Vector-matrix multiplication

ABSTRACT

An integrated VMM (vector-matrix multiplier) module, including an electro-optical VMM component that multiplies an input vector by a matrix to produce an output vector; and an electronic VPU (vector processing unit) that processes at least one of the input and output vectors. Various error reducing mechanisms are also discussed.

RELATED APPLICATIONS

The present application is a U.S. national application of PCTApplication No. PCT/IL02/00727, filed Sep. 3, 2002, the disclosure ofwhich is incorporated herein by reference.

FIELD OF THE INVENTION

The present invention is related to applications using vector matrixmultiplication.

BACKGROUND OF THE INVENTION

In the quest for higher processing power, various hardware architectureshave been proposed, including digital signal processors (DSPs),application specific integrated circuits (ASICs), field programmablegate arrays (FPGAs) and general CPUs. As might be expected, however,even the fastest processors are not fast enough for the newest real-timeapplications that are conceived by system designers. Typically, anyalgorithm used is optimized to take into account the characteristics ofthe particular hardware/software implementation. A typical optimizationmethod is to not use available information to its fullest, thusconserving computing time and remaining within the limits of thehardware capabilities, trading off performance for quality.

Cellular telephone systems are well known. One of the new systemconcepts is the UMTS (Universal Mobile Telecommunication System)architecture, also known as the 3G architecture. The current suggestedstandards are 3GPP1, for UMTS and 3GPP2 for a different concept known asCDMA2000. Both concepts define various protocols for implementing highdata rate digital communications using a WBCDMA (Wide Band Code DivisionMultiplexing Access) method.

A considerable amount of signal processing is required to implement thealgorithms defined by the suggested standards, especially in the basestation where signals from multiple users, all broadcasting at the sametime and frequency, must be detected and analyzed. The standard solutionis to optimize the algorithms, for execution on DSPs or ASICs. However,even after such optimization, available processing power is notsufficient for the task and many of the protocols are not implemented ina complete manner (e.g., ignoring some available information and tradingoff performance for quality) or, alternatively the protocols areimplemented on a multi-card device, with the users distributed betweenmultiple costly cards.

A processor architecture referred to as “Stanford optical VMM”,described for example in Dror G. Fietelson, “Optical Computing”, Chapter4.3, MIT press 1988, the disclosure of which is incorporated herein byreference, suggests performing vector matrix multiplication (VMM) usingan optical model based on a transparency matrix. An analog electronicvector-matrix multiplication unit is described, for example, in“Programmable Analog Vector-Matrix Multipliers”, by F. Kub, K. Moon, I.Mack, F. Long, in IEEE Journal of Solid-State Circuits, vol. 25 (1) pp.207-214, 1990, which is incorporated herein by reference.

U.S. Pat. Nos. 4,937,776, 5,448,749 and 5,321,639 apparently describearchitectures including optical components, which are suggested for usefor matrix/vector manipulation.

SUMMARY OF THE INVENTION

A broad aspect of some embodiments of the invention relates to aprocessor (referred to wherein as a VMM processor) including avector-matrix multiplier (VMM) core, adapted to perform vector matrixmultiplication. The VMM core is optionally implemented using anelectro-optical architecture.

In some embodiments of the invention, the processor including the VMMcore is a self-sufficient VMM processor, optionally provided as a singlemodule, which includes at least one other processing unit that isadapted to perform operations that the VMM core is not optimally adaptedto perform. The at least one other processing unit optionally includes avector processing unit (VPU) and/or a scalar processing unit (e.g., aDSP). Potentially, such integration of a VMM core and at least one otherprocessing unit allows a faster operation and/or reduces interference.

Alternatively or additionally, the self-sufficiency of the processor ischaracterized by the ability of the module to store and utilize localresults and/or the ability to reconfigure itself (e.g., calculate and/orchange the matrix values). Thus, an independently functioning module isprovided in some embodiments and there are fewer inter-devicecommunications and fewer slowdown problems when the module is integratedinto a complete system.

Further alternatively or additionally, the processor includes one ormore memory units. The memory units may be used, for example, forstoring matrix data, for storing intermediate results, old results,static data, various parameters and/or micro-code or sequencinginstructions for the module and/or parts thereof.

Optionally, the VMM processor includes a controller. In an exemplaryembodiment of the invention, the controller sequences the operation ofthe module and/or acceptance of input and/or transmission of output, forexample, sequencing a series of interspersed transform operations, VPUoperations, matrix changes and DSP operations. In a particular example,the controller buffers vector input while a matrix is being replaced.

In some embodiments of the invention, the VMM processor isreconfigurable for various applications. Alternatively or additionally,in a particular application, the module is reconfigurable betweenseveral operating modes so that a dynamic resource allocation algorithmmay be utilized. In a particular example, a module that is configured toperform two operations, may be reconfigured to perform only oneoperation or to achieve a higher accuracy from one operation at theexpense of the other operation. In another example, a same module isused for two or more different functions, by changing the matrix and/orsteps performed by the VPU.

In some embodiments of the invention, the VMM processor is used forapplications in which the bulk (e.g., number and/or scalar processingsteps) of the operations performed belong to a VMM task.

Optionally, the VMM core includes a matrix of transparency orreflectance elements, whose level of transparency or reflectance(referred to herein as attenuation) represents a respective mathematicalmatrix value. Input vector values are optionally converted into lightbeams which are directed to elements of the matrix to perform themultiplication. Performing vector-matrix multiplication using a VMM coreachieves much higher processing speeds than achievable using prior artprocessor architectures, for algorithms using a significant number ofvector-matrix multiplication steps.

In some embodiments of the invention, the attenuation values of thematrix elements are changeable. Alternatively, the VMM processor allowsupdate of only a part of the matrix, for example, only one or more rows,columns, rectangular blocks and/or any other portions. VMM someembodiments of the invention, a matrix portion is updated while the restof the matrix is operational. In one example, the matrix includes one ormore redundant rows which may be used instead of other rows of thematrix. Optionally, the values of the redundant rows may be replacedwhile the remaining matrix elements are used in performing vector-matrixmultiplication.

In some embodiments of the invention, a single VMM processor includes aplurality of VMM cores which operate in parallel and/or pipelined, onsame or different vectors. Alternatively or additionally, the VMM coreperforms a sequence of a plurality of VMM operations on a single inputvector.

An aspect of some embodiments of the invention relates to a VMMsub-system which includes a VMM core. The VMM sub-system additionallyincludes a pre-processing unit and/or a post-processing unit, which areused to enhance the accuracy of the VMM sub-system, to compensate fordefected and/or otherwise imperfect hardware of the VMM core, to reducecross-talk, and/or to improve signal to noise ratio.

Optionally, the VMM core comprises an electro-optical unit whichperforms optical multiplication. The pre-processing and/orpost-processing are optionally performed in electronic digital form.Alternatively or additionally, the pre- and/or post-processing may beused for signal processing purposes.

In some embodiments of the invention, the pre-processing and/orpost-processing include scrambling the data and/or changing the range ofthe data, in order to reduce errors due to non-linearities and/orlimited dynamic range of the VMM core. Alternatively or additionally,the input vector values are adjusted to compensate for inaccuratespreading of light beams, which impinge on neighboring matrix elementsand/or detectors in addition to the matrix elements and/or detectors towhich they are directed. Optionally, the pre-processing and/orpost-processing depend on calibration tests performed periodically.Alternatively or additionally, the pre-processing and/or post-processingare performed according to a predetermined model.

In some embodiments of the invention, a single mathematicalvector-matrix multiplication involves performing a plurality of VMM coreoperations, in order to enhance the accuracy of the result. Optionally,the input vector is partitioned into a plurality of bit planes or groupsof bit planes and each bit-plane group is multiplied separately by thematrix. The bit plane group results are thereafter combined.Alternatively or additionally, a VMM core operation is performed on thesame input data a plurality of times and a final result is derived as anaverage of the results of the plurality of operations.

Alternatively or additionally to performing a plurality of VMM coreoperations for a single calculation, when a multiplication of an inputvector smaller than the matrix of VMM core is performed, the unusedelements of the matrix are used for redundant multiplication and/or forbit-plane partitioning. In an exemplary embodiment of the invention,when the input vector is smaller than the size of the matrix, the extracapacity of the VMM core is optionally utilized to enhance theperformance of the VMM core. Optionally, at least some of the matrixvalues are duplicated so that they are represented by a plurality ofmatrix elements. The corresponding input values are optionally directedthrough the plurality of elements and the resulting output values areaveraged. This averaging increases the accuracy of the vector matrixmultiplication.

An aspect of some embodiments of the invention relates to a VMMimplementation that includes redundant components, available forcompensating for damaged components. For example, a VMM implementationmay include redundant input, output and/or matrix elements. In anexemplary embodiment of the invention, the input and/or output arerouted through operable elements, in order to take advantage of theredundancy. Optionally, the redundant components are managed by the VMMprocessor in a manner that is transparent to an external host of the VMMprocessor. Alternatively or additionally, the quality of the differentVMM components are monitored, so that better quality components may beused for computations that require a higher accuracy.

In some embodiments of the invention the VMM core includes one or moreredundant elements, such as light detectors, light sources and/or matrixelements. The pre-processing and/or post-processing optionally includesselecting the elements to be used for a specific processing session.

An aspect of some embodiments of the invention relates to a physicalimplementation of an optical VMM core, in which the light sourcesgenerating the input vector values and/or the detectors generating theresults are organized in a two dimensional array. Accordingly, eachmathematical matrix row or column is represented by a two dimensionalarray of matrix elements, arranged, for example, in a square or acircle.

An aspect of some embodiments of the invention relates to a physicalimplementation of an optical VMM core including a plurality of matrices.Optionally, the plurality of matrices are each smaller than themathematical matrix they represent. Alternatively or additionally, eachof the plurality of matrices represents a portion of the representedmathematical matrix. Using smaller matrices allows easier productionprocesses and/or production with a higher yield.

The light from the plurality of matrices is optionally led to differentrespective detector arrays. In some embodiments of the invention, thelight is led from the matrices to the detectors on non-overlapping(although possibly crossing) light paths. For example, the matrices maybe aligned relative to each other at an angle different than 90°.Alternatively or additionally, the light from different matricesdirected toward the detectors is in different respective polarizations.

In some embodiments of the invention, the plurality of matrices are usedin an implementation including a polarizing beam splitter (PBS) in orderto receive substantially all the light from the light sources, when thepolarization from the light sources is not known (uncontrolled).Optionally, the plurality of matrices includes two matrices which havethe same element values. The beam splitter splits the generated lightbeams between the two matrices according to the uncontrolledpolarization. Since two identical matrices are used, the polarizationdoes not matter and light not received by the first matrix is handled bythe second matrix.

An aspect of some embodiments of the invention relates to compensationfor polarization artifacts of a light source. In an exemplary embodimentof the invention, an array of VCSELs is used, however, each VCSEL has apreviously unknown, possibly different polarization. A polarizationsensitive (optionally reflective) SLM is possibly used. In an exemplaryembodiment of the invention, the light from a VCSELs is split by apolarizing beam splitter, so that each component of light hits asuitably oriented (polarization-wise) SLM. The resulting processed beamsof light are overlapped, for example, by the same beam splitter anddetected by a detector. If the two SLMs are controlled in correspondingmanners, a same result is expected independent of the originalpolarization, or even if the polarization changes over time.

A broad aspect of some embodiments of the invention relates to using aVMM architecture for various applications where, hereunto, VMM methodswere not used. In particular embodiments of the invention, inapplication for which algorithms were optimized for serial or parallelarchitectures, algorithms are, in accordance with some embodiments ofthe present invention, optimized for VMM architectures. In someapplications, algorithms that are naturally (mathematically) VMM typeand were previously optimized for serial implementation on electroniccircuits, are now un-optimized. However, as will be noted below,optically realized VMM architectures have a potential speed that isapparently and currently much higher than electronics (for similar sizesand/or power loads), allowing some brute force algorithms to be appliedwith reasonable cost, so the previous optimization is no longer ideal.New optimization for VMM and/or VPU operation may be practiced instead,however, in some embodiments of the invention. Particular applicationinclude cellular telephone signal processing, such as a WBCDMA receiverat a base station and face recognition for automated cameras. Exemplaryadditional applications include, the XDSL (Digital Subscriber Line)family of wire modems, OFDM (Orthogonal Frequency Division Multiplexing)technology, GSM, EDGE (2.5G) and other cellular communication systems,VDB Wireless Broadcast, Networking applications such as Packetprocessors; routers and switches, Compression and Decompressionprotocols such as JPEG, MPEG, MP3, CELP/LPC voice, Spectrum analyzersand/or Machine vision systems, such as Correlation engines.

A broad aspect of some embodiments of the invention relates to theimplementation and/or modification of various 3G or other cellulartelephone system algorithms, such as “smart antenna” to apply VMMoperations, for example, on a VMM architecture and/or a vector or scalararchitecture. A particular benefit of some VMM architectures is thatrobust real-time processing may be provided. In an exemplary embodimentof the invention, a limited accuracy architecture, such as a non-digitaloptical architecture, is used. While errors due to limited accuracy aregenerally undesirable, in the case of WBCDMA, the arriving data is errorladen and the algorithms are sufficiently robust to work suitably evenif their implementation is not perfect. Another exemplary suchapplication is decoding bits by correlation of a measured input stringof values with a reference code. The accuracy of each individualmultiplication and the sum is often not that critical, if, for example,the sum is tested against a threshold, so small errors can be tolerated.After the bits are detected (e.g., by the above method), an errorcorrection protocol is optionally applied (e.g., digitally). In someembodiments of the invention, it is assumed that large fraction of thebits are wrong (mainly due to radio interference), so computation errorwill at most add few more wrong bits.

In an exemplary embodiment of the invention, a VMM architecture is usedin a rake receiver. In one example, the VMM is used for searching forsignal paths. Alternatively or additionally, the VMM architecture isused for tracking signal paths. Alternatively or additionally, the VMMarchitecture is used for implementing finger decoders. It should benoted that a fast VMM implementation (which is generally attainable insome embodiments of the invention) can provide multiple fingers for eachuser, for example, 4 or 8 fingers.

In an exemplary embodiment of the invention, a VMM architecture is usedto implement a Multi User Detection (MUD) algorithm. In an exemplaryembodiment of the invention, the MUD algorithm is implemented as aparallel algorithm, so that the interference from a large plurality ofpath signals and/or user signals are removed at each iteration, ratherthan only one at a time. In one implementation, the signals are removedprior to determining the delay for each signal.

In an exemplary embodiment of the invention, when part of an interferingsignal cannot be estimated with confidence, the signal part is notsubtracted out, or is fractionally subtracted, for example, a value ofhalf a bit, instead of a value of a whole bit.

In an exemplary embodiment of the invention, a VMM architecture is usedfor decoding signals having variable spreading factors, for example, todecode high bandwidth channels (e.g., fast Internet data Service) mixedwith low bandwidth channels (e.g., voice call).

In an exemplary embodiment of the invention, a VMM architectureimplements a smart uplink antenna, in which the signals from a plurality(e.g., 2, 4, 10 or more) antenna elements are processed together toyield an effectively narrow-angle receiving antenna. Optionally, a smartantenna is used to separate high inference users, such as high data rateusers, while other users are separated out using other methods, such asMUD.

In an exemplary embodiment of the invention, a VMM architecture is usedto implement a smart downlink antenna. In an exemplary embodiment of theinvention, a secondary spreading sequence sent to the target telephone(e.g., by sending different sequences for different lobes of theantenna) is analyzed (once returned) to determine a desirable transmitdirection path to use for sending data to the target telephone.

An aspect of some embodiments of the invention relates to a detectionmethod for detecting in parallel two signals that have a temporaloffset. In an exemplary embodiment of the invention, the correlationmatrix includes, for a single user, at least two signals, each signalincluding contributions from at least two consecutive bits. One signalis two same valued bits and one is two alternating valued bits. If alarger number of bits is detected, a larger number of different signalsmay be required. The partitioning point between the bits is dependent onthe delay for the user (or path) which can be known, for example, basedon tracking or on detection of pilot bits. Optionally, the detection isapplied on an input vector portion having a size of at least two bits,for example, to assist in detecting correlation in cases where thepartitioning point is near the end of the input vector. Consecutivecorrelations may utilize overlapping input vector portions. Thedetection method may be used, for example, for MUD or for otherdetection methods. In a VMM implementation, a simple correlation may beused. However, if a DSP is used (as well or instead of the VMM) moreadvanced correlation/detection methods may be used, for example,weighted correlation. For example, if a first detection is “01” and asecond detection is “00”, there is a disagreement on the middle bit. Adecision between the two correlation may be made, for example, base onthe quality of each correlation and/or the length of the bit include dinthe corresponding input vector portion.

An aspect of some embodiments of the invention relates to using parallelvector processing for processing data at a cellular base station.

An aspect of some embodiments of the invention relates to compensatingfor an inexact implementation of a transform method. In an exemplaryembodiment of the invention, a transform is implemented by convolutionof a vector with a matrix of sub-elements. In a calibration step, theinexact realization of the transform is determined, so that therealization can be corrected by modifying the matrix so that applicationof the modified matrix would yield, as a result of the imperfection thedesired transform. Alternatively or additionally, the correction isapplied after the VMM operation, for example, by a VPU. In an exemplaryembodiment of the invention, by pre-correcting a slowly changing matrix,correction of a frequently changing result is avoided.

There is thus provided in accordance with an exemplary embodiment of theinvention, an integrated VMM (vector-matrix multiplier) module,comprising:

an electro-optical VMM component that multiplies an input vector by amatrix to produce an output vector; and

an electronic VPU (vector processing unit) that processes at least oneof the input and output vectors. Optionally, the module comprises a DSP(digital signal processor) that processes at least one of the inputvector, the output vector or the matrix. Alternatively or additionally,the module comprises a memory that stores at least one of matrixreplacement values, at least one previous output vector and instructionsfor a component of the module.

In an exemplary embodiment of the invention, at least one of said DSPand VPU are programmed to calculate an update value for at least part ofsaid matrix.

In an exemplary embodiment of the invention, said VMM component includesa local memory buffer for update values of said matrix.

In an exemplary embodiment of the invention, said VMM module comprises aregister file adapted for exchanging information between said VMM andsaid VPU. Optionally, said register file includes a register copyability for transferring information between registers.

In an exemplary embodiment of the invention, said module comprises aparameter extractor which extracts at least one parameter from at leastone of said vectors. Optionally, said parameter comprises an extremevalue element.

In an exemplary embodiment of the invention, said VMM module comprises apre-processor which preprocesses said input vector, to improve a qualityof its processing by a matrix component of said VMM.

In an exemplary embodiment of the invention, said VMM module comprises apre-processor which preprocesses said input vector, to correct forartifacts caused by processing by a matrix component of said VMM.

In an exemplary embodiment of the invention, said VMM module a vectorbuffer for buffered input from an external circuit. Optionally, saidbuffer receives 8 bit data in parallel.

There is also provided in accordance with an exemplary embodiment of theinvention, an integrated VMM (vector-matrix multiplier) module,comprising:

a electro-optical VMM component that multiplies an input vector by amatrix to produce an output vector; and

a controller, wherein said controller is operative to replace values inonly a part of said matrix.

There is also provided in accordance with an exemplary embodiment of theinvention, a VMM (vector-matrix multiplier) component, comprising:

a plurality of input elements that represent an input vector;

a plurality of electro-optical matrix elements that represent atransformation matrix; and

a plurality of detector elements that detect signals from said inputelements after they are modulated by said matrix elements such that saiddetected signals represent result vector of a vector matrixmultiplication of said input vector,

wherein said component comprises at least one redundant element in atleast one of said input, matrix and output elements. Optionally, saidcomponent comprises a fan element for at least one of input fanning andoutput fanning, wherein said fan is programmable to selectively utilizesaid at least one redundant element. Optionally, said componentcomprises a controller which manages said at least one redundantelement.

There is also provided in accordance with an exemplary embodiment of theinvention, a VMM component, comprising:

an array of sources;

at least one optical element which spreads the light from one sourceinto a two-dimensional beam;

an SLM having logical rows arranged in a two-dimensional manner to matchsaid beam; and

a detector which detects the contributions of modulation of said SLM formultiple beams,

wherein said array is a two-dimensional array representing a onedimensional array. Optionally, said SLM is reflective.

There is also provided in accordance with an exemplary embodiment of theinvention, a VMM component, comprising:

an array of sources having imperfect polarization orientation;

at least one lens which spreads the light into a beam;

a beam splitter which splits said beam into first and secondpolarization components;

at least a first SLM that modulates said first polarization component ofsaid beam;

at least a second SLM that modulates said second polarization componentof said beam;

a detector array which detects the contributions of modulation of saidSLM for multiple beams from both polarization components. Optionally,said beam splitter is a polarizing beam splitter that combines saidmodulated beams. Alternatively, different elements of said array detectbeams from different polarization.

In an exemplary embodiment of the invention, said detector arraycomprises a polarizing beam combiner for combining said polarizationcomponents.

In an exemplary embodiment of the invention, said SLMs are perpendicularto each other. Alternatively, said SLMs are not perpendicular to eachother.

There is also provided in accordance with an exemplary embodiment of theinvention, an optical vector matrix multiplier (VMM), comprising:

an array of light sources, adapted to generate light beams representinga multiplied vector;

at least two reflective matrixes adapted to spatially modulate lightfrom the light sources; and

a detector array adapted to detect light from the reflective matrixes.Optionally, said VMM comprises a beam splitter adapted to receive lightgenerated by the array of light sources and direct the received light toone or more of the matrixes. Optionally, the beam splitter comprises apolarization beam splitter. Alternatively or additionally, the beamsplitter provides each of the matrixes with a predetermined percentageof the light of each of the generated light beams. Alternatively, theamount of light provided by the beam splitter to the matrixes is notpredetermined.

In an exemplary embodiment of the invention, at least one of thematrixes has fewer elements than the number of light sources multipliedby the number of detectors. Alternatively or additionally, at least someof the elements of the matrixes represent values of a mathematicalmatrix and the elements of at least one of the matrixes represent fewerthan all the elements of the mathematical matrix.

There is also provided in accordance with an exemplary embodiment of theinvention, a method of improving signal detection in an electro-opticalVMM, comprising:

receiving an input vector and a matrix to be processed by said VMM; and

rearranging said input vector on an input of said VMM and said matrix ina matrix portion of said VMM, in a manner that improves signaldetection. Optionally, rearranging comprises spatially separating vectorelements to reduce cross-talk. Alternatively or additionally,rearranging comprises duplicating at least some vector elements.Optionally, rearranging comprises duplicating an entire vector.

In an exemplary embodiment of the invention, rearranging comprisesrearranging said matrix.

In an exemplary embodiment of the invention, rearranging comprisesrearranging said input vector so that at least some light sources ofsaid VMM can be extinguished.

There is also provided in accordance with an exemplary embodiment of theinvention, a method of improving signal detection in an electro-opticalVMM, comprising:

receiving an input vector and a matrix to be processed by said VMM; and

adapting values of at least one of said input vector on an input of saidVMM and said matrix in a matrix portion of said VMM, in a manner thatimproves signal detection. Optionally, adapting comprising negatingvalues of at least some vector elements. Alternatively or additionally,adapting comprises shifting a baseline value to be non-zero, such thatlight sources of the VMM are not extinguished to achieve this base linevalue. Alternatively or additionally, adapting comprises amplifying orreducing input value to make use of an available dynamic range of saidVMM. Alternatively or additionally, adapting comprises shifting an inputvalue base line to make use of an available dynamic range of said VMM.Alternatively or additionally, adapting comprises applying a linearitycorrection. Alternatively or additionally, adapting comprises weightingvector elements with weights that correspond to a number of zero valuesin a corresponding matrix column. Alternatively or additionally,adapting comprises weighting at least one of vector elements and matrixelements with weights that correspond to an average of values incorresponding matrix columns.

There is also provided in accordance with an exemplary embodiment of theinvention, a method of improving signal detection in an electro-opticalVMM, comprising:

receiving an input vector and a matrix to be processed by said VMM;

processing said vector by said VMM to produce an output vector; and

adapting values of said output vector, by applying a history correctionwhich corrects for residual affects of a previous computation performedby said VMM. Optionally, said adapting comprises applying a temperaturecorrection. Optionally, said adapting comprises applying a correctionfor an adaptation made to said input vector.

There is also provided in accordance with an exemplary embodiment of theinvention, a method of determining a user path in a cellular system,comprising:

transmitting a different secondary spreading code on different lobes ofan antenna; and

determining a user direction based on the secondary spreading codeactually adopted by the user.

There is also provided in accordance with an exemplary embodiment of theinvention, a method of assisting detection of a CDMA signal, comprising:

estimating a signal of an interfering path from an input vector; and

subtracting parts of said estimated path signal from said input vector,while not fully subtracting parts of said estimated signal that have alow confidence. Optionally, the method comprises subtracting afractional value for signal parts, responsive to the confidence in theestimation of the parts.

There is also provided in accordance with an exemplary embodiment of theinvention, a method of detection of a CDMA path signal, comprising:

providing an input vector;

generating a composite estimation signal including contributions from atleast two consecutive bits; and

correlating the composite signal with an input vector. Optionally, saidcorrelating comprises correlating with a portion of said input vectorthat is large enough to contain two bits. Alternatively or additionally,the method comprises joining the contributions of correlations onsuccessive input vector portions.

There is also provided in accordance with an exemplary embodiment of theinvention, a method of MUD (Multi User Detection), comprising:

estimating the signals of a plurality of paths in parallel; and

subtracting said signals together from an input vector.

There is also provided in accordance with an exemplary embodiment of theinvention, a method of CDMA signal detection, comprising:

receiving an input signal as an input vector; and

at least one of detecting and decoding said signal using a vector matrixmultiplier for at least one of processing multiple path signals inparallel and multiple parallel correlations. Optionally, the methodcomprises multi-user-detection using a parallel path estimation.Alternatively or additionally, the method comprises multi-user-detectionusing a decorrelation detection. Alternatively or additionally, themethod comprises applying a smart antenna algorithm by detecting andsubtracting out contributions from strong interfering signals.Alternatively or additionally, the method comprises arranging paths ingroups for separate processing for each spreading factor.

There is also provided in accordance with an exemplary embodiment of theinvention, a VMM component, comprising:

an array of sources elements;

at least one optical component which spreads the light from one sourceinto a beam;

an SLM comprising a plurality of elements arranged in logical rowsarranged to match said beam; and

an array of detector elements which detect the contributions ofmodulation of said SLM for multiple beams,

wherein at least one of said source elements, said detector elements andsaid SLM elements are non-uniformly sized. Optionally, said uniformityis selected to compensate for a non-uniformity of light intensitydistribution of said component. Alternatively or additionally, at leastone of the edge components are made larger. Alternatively oradditionally, the effective optical size of non-edge components is madesmaller by intentional degradation.

There is also provided in accordance with an exemplary embodiment of theinvention, a method of processing using and calibrating a VMM component,comprising:

loading said VMM with an input data vector and a matrix;

reserving at least one of an input data vector element, an outputdetector element and a matrix element for calibration;

processing said input vector using said VMM and said matrix to produce aresult vector;

determining at least one calibration value based on said at least onereserved element; and

correcting said result vector based on said calibration value.Optionally, said at least one reserved element comprises a plurality ofdetectors used to detect a sum value. Optionally, said at least onereserved element comprises at least one matrix row and at least oneinput vector element.

BRIEF DESCRIPTION OF THE FIGURES

Non-limiting exemplary embodiments of the invention will be described infollowing description of exemplary embodiments, read in conjunction withthe accompanying figures. Identical structures, elements or parts thatappear in more than one of the figures are labeled with a same orsimilar numeral in all the figures in which they appear.

FIG. 1 is a schematic block diagram of a vector matrix multiplier (VMM)signal processing engine (SPE), in accordance with an exemplaryembodiment of the invention;

FIG. 2A is a schematic block diagram of an exemplary electro-optical VMMcore, in accordance with an exemplary embodiment of the invention;

FIG. 2B is a schematic illustration of an optical portion of anelectro-optical VMM core, in accordance with an exemplary embodiment ofthe present invention;

FIG. 3A is a schematic side view of an optical implementation of a VMMcore, in accordance with an exemplary embodiment of the invention;

FIG. 3B is a schematic three-dimensional view of the opticalimplementation of FIG. 3A, in accordance with an exemplary embodiment ofthe invention;

FIG. 4 is a schematic side view of an optical implementation of a VMMcore, in accordance with another exemplary embodiment of the invention;

FIG. 5A is a schematic side view of an optical implementation of a VMMcore, in accordance with still another exemplary embodiment of theinvention;

FIG. 5B is a schematic side view of an optical implementation of a VMMcore, in accordance with still another exemplary embodiment of theinvention;

FIG. 6A is a schematic side view of an optical implementation of a VMMcore, in accordance with still another exemplary embodiment of theinvention;

FIG. 6B is a schematic illustration of the organization of an array ofVCSELs, in accordance with the VMM core of FIG. 6A;

FIG. 6C is a schematic illustration of a reflective matrix suitable foruse in accordance with the VMM core of FIG. 6A;

FIG. 6D is a schematic illustration of a detector array suitable for usein the VMM core of FIG. 6A;

FIG. 6E is a schematic illustration of a reflective matrix suitable foruse with the VMM core of FIG. 6A;

FIG. 7 is a schematic block illustration of a UMTS system suitable forimplementation using a module as described in FIG. 1;

FIG. 8 is a schematic diagram of an information transmission andreception process, in accordance with an exemplary embodiment of theinvention, and based on the UMTS standard;

FIG. 9 is a schematic block diagram of a rake receiver, in accordancewith an exemplary embodiment of the invention and suitable for example,for use in FIG. 7;

FIG. 10 is a flow diagram for a rake receiver, in accordance with anexemplary embodiment of the invention, in which hexagons representpartial or complete matrix updates;

FIG. 11 is a schematic flow diagram of a rake receiver including MUD(multi-user detection), in accordance with an exemplary embodiment ofthe invention; and

FIG. 12 is a schematic data flow diagram for a rake receiverimplementation that includes a smart uplink and/or downlink antenna andMUD, in accordance with an exemplary embodiment of the invention.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

SPE Overview

FIG. 1 is a schematic block diagram of a vector matrix multiplier (VMM)based signal processing engine (SPE) 102, in accordance with anexemplary embodiment of the invention. SPE 102 includes a VMM sub-system199 adapted to perform vector matrix multiplication at a relatively highspeed. Sub-system 199 includes a VMM core 204 which is adapted toperform multiplication of input vectors by a matrix at a relatively highrate. Optionally, VMM core 204 includes a programmable matrix, such thatthe matrix multiplying the input vectors may change according to thetask performed and/or during the performance of a task (i.e., accordingto the steps of the task). In addition to VMM core 204, VMM sub-system199 includes additional elements as is now described, which are used totranslate between mathematical values and values processed internally byVMM core 204.

In some embodiments of the invention, for example as described belowwith reference to FIG. 2, matrix data of VMM core 204 is preprocessed bya matrix pre-processor (MPP) 225. A matrix memory 220 optionally storesvalues of a plurality of matrices and/or matrix portions which may beused for fast replacement of the matrix in VMM core 204. The vectorsmultiplied in VMM core 204 are optionally passed through a (optional)pre-processor 302 on their way to the VMM core. Similarly, the resultingvectors are optionally passed through a (optional) post-processor 314.The operation of pre-processor 302 and post-processor 314 is describedin detail below, with reference to FIG. 2.

In some embodiments of the invention, SPE 102 further includes a vectorprocessing unit (VPU) 206 for performing vector-vector operations and/orvector scalar operations, as described in detail below. A parameterextractor unit 242 optionally determines maximum, minimum and/or otherparameters of single vectors and/or vector streams, as described below.

In some embodiments of the invention, vector data is received by SPE 102on an input line 205, through an optional high speed input port (HSIP)211. Processed vector data is optionally provided on an output line 209,through an optional high speed output port (HSOP) 221. In someembodiments of the invention, a vector buffer 202 is used to regulatethe input and/or output of vector data, so as to allow transferring thevector data at a rate different from the operation rate of SPE 102.Vector buffer 102 optionally has a capacity sufficient to storethousands of vectors, for example 4000 vectors of 256 elements.

Alternatively to having a single buffer 202 for input and output,separate buffers are provided for input and output. Furtheralternatively or additionally, the input and/or the output areunbuffered.

VMM core 204, VPU 206, parameter extractor 242 and vector buffer 202optionally exchange vectors between them through an optional vectorregister file 213. Register file 213 optionally includes dedicatedregisters for input and output of each element communicating through theregister file. In one exemplary implementation, vectors inputted frombuffer 202 are optionally placed automatically in a “buffer in” register251, while vectors to be outputted to buffer 202 are placed in a “bufferout” register 252, for retrieval. In some embodiments of the invention,register file 213 optionally includes one or more general purposeregisters 258 which are used to store intermediate vectors during theirprocessing. One or more shift registers 260 are optionally used to shiftthe elements of a vector internally and/or between two vectors, asdescribed below.

Alternatively or additionally to using vector register file 213, theunits of SPE 102 may communicate using any other method known in theart, such as a multiple access bus and/or dedicated point to pointbuses.

The units of SPE 102, except for a DSP 214 (e.g., VMM sub-system 199,VPU 206, buffer 202, register file 213, HSIP 211, HSOP 221, parameterextractor 242), jointly referred to as an APL 118, are optionallycontrolled by a controller 210. Controller 210 optionally providescontrol signals to the units on control lines 236, which for simplicityare only shown near controller 210. Optionally, controller 210 receivescommands which state for each unit of SPE 102, the tasks that the unitis to perform in the current cycle. A controller memory 216 optionallystores command sequences to be carried out by controller 210.

In some embodiments of the invention, the commands of controller 210include a field for each of the controllable units of SPE 102. Forexample, the commands may include six fields, for buffer 202, VMMsub-system 199, register file 213, parameter extractor 242, VPU 206 andAPL controller 210. All the instructions in the fields of a singlecommand are optionally performed in a single clock cycle. The field forAPL controller 210 optionally includes program flow control commands,such as branch and looping commands. Optionally, when a unit is not usedin a certain cycle, its field includes a no operation command. In someembodiments of the invention, fields of at least some of the units mayinclude parameter setting commands, such as commands for replacing thematrix elements of VMM core 204 and/or changing the operation mode ofthe VMM core. The field of register file 213 optionally may includeregister transfer commands (i.e., commands for transferring vectorsbetween registers), register zeroing commands and/or shift commands forthe shift registers.

In an exemplary embodiment of the invention, the command field of VPU206 indicates a precision to be used by the VPU, a method of overflowtreatment and/or a parameter of truncation, such as whether to use afloor or round function and/or the number of bits to be truncated.

In some embodiments of the invention, the commands carried out bycontroller 210 and/or stored in program memory 216 are generated by DSP214 (or a host computer), optionally by compiling high languagedirectives. Optionally, DSP 214 may activate specific sub-routinesand/or procedures stored in memory 216. In addition, DSP 214 mayoptionally halt the operation of controller 210 and/or force a flowbranch, for example for debugging. In some embodiments of the invention,DSP 214 may provide controller 210 with operational parameters (e.g.,correction coefficients) for the units of SPE 102, such as pre-processor302, post-processor 314 and/or MPP 225.

In some embodiments of the invention, SPE 102 uses digital signalprocessor (DSP) 214 for performing scalar operations. For example, DSP214 may be used for complex decision algorithms and/or for floatingpoint operations. Alternatively or additionally, DSP 214 may performvector operations which cannot be or are chosen not to be performed byVPU 206 and/or VMM core 204. Generally, DSP 214 performs such operationsseparately on each element of the processed vector, optionallysequentially. Optionally, DSP 214 is associated with a DSP memory 229for storing intermediate data and/or scalar results. Alternatively oradditionally, memory 229 is used to store subroutines and/or otherinstructions to be operated by DSP 214. DSP memory 229 may includesubstantially any suitable memory type, for example RAM, ROM and/or acombination thereof. Alternatively or additionally to DSP 214, SPE 102may include a general purpose processor and/or a dedicated ASIC for somescalar tasks.

In some embodiments of the invention, DSP 214 controls the operation ofSPE 102, for example based on instructions from an external host and/oraccording to pre-stored programs in DSP memory 229. Optionally, DSP 214instructs controller 210 on the tasks it is to perform, for example bystating library subroutines of commands that are to be performed. DSP214 optionally communicates with controller 210 over a bus 240, whichadditionally allows DSP 214 and/or controller 210 to communicate withother units of SPE 102. Alternatively, DSP 214 communicates with theother units only through controller 210 and/or register file 213.

In some embodiments of the invention, a host interface port 212 connectsDSP 214 to an external computer host, optionally through a serial orparallel (e.g., PC, VME) device line. In an exemplary embodiment of theinvention, host interface port 212 provides matrix data and/or operationinstructions. In an exemplary embodiment of the invention, hostinterface port 212 comprises a 32 bit 33 MHz slave-only PCI interface.Alternatively or additionally, other interface ports may be used, forexample, a serial connection such as one using RTS/RTR (request tosend/restart request) lines, a pull configuration, in which SPE 102requests data, and/or a push configuration in which data is pushed toSPE 102. Various interrupt schemes (from/to SPE 102) are optionallyimplemented. In some embodiments of the invention, DSP 214 is connectedto a debugging interface (not shown), such as a JTAG interface.

Alternatively to including a separate DSP 214 and controller 210, insome embodiments of the invention, a single processor is used both forcontrol and scalar processing. These embodiments may be used, forexample, when the extent of scalar processing is expected to berelatively small. Further alternatively or additionally, DSP 214 is usedinstead of VPU 206, for example when the extent of vector operations(which do not include matrix multiplication) is relatively small.Further alternatively or additionally, two or more of memories 216, 220and 229 are combined. Alternatively or additionally, each memory unitcomprises a suitable unit (RAM, ROM, etc.) according to its specifictask.

In some embodiments of the invention, SPE 102 is used primarily formatrix multiplication tasks. Parameter extractor 242 may be used, forexample, to find a maximum of the resultant product vectors and/or tofind a maximum of the input vectors. In other embodiments of theinvention, more complex processing schemes may be performed by SPE 102.For example, VPU 106 may be used in some or all of the input vectorsand/or product vectors. Some vectors handled by SPE 102 may not bepassed at all to core 204 or may be passed through VMM core 204 aplurality of times. In an exemplary alternative embodiment, VMM core 204operates as a mathematical co-processor of VPU 206 and/or the otherelements of SPE 102. In some cases, an algorithm or a step in analgorithm performed by SPE 102 will only require processing by VPU 206and/or DSP 214.

Alternatively or additionally to pre-processor 302 and/or post-processor314, VPU 206 may be used to perform pre-processing and/orpost-processing tasks.

In some embodiments of the invention, SPE 102 is produced in a smallsize which fits into industry standard electronic cards and/or racks. Inan exemplary implementation, SPE 102 is implemented by a 10×8 cm chipwith a thickness of 1.7 cm. VMM sub-system 199 is optionally positionedin the center. The I/O ports 211 and 221 are optionally located on twoopposite sides of the length, a DSP along a third side and the othercomponents along the fourth side.

In some embodiments of the invention, vector elements have apredetermined number of bits, for example 8 bits, which are used tostate an integer value. Optionally, each vector may be accompanied by anexponent value, for example of 2 bits, which states for all the elementsof the vector a multiplication factor. Optionally, a defaultmultiplication factor of 1 is used.

Complex Vectors

In some embodiments of the invention, complex vectors are represented inSPE 102 by a pair of vectors, a real vector which stores the real partsof the complex vector elements and an imaginary vector which stores thecomplex parts of the complex elements. The vector matrix multiplicationis optionally performed by multiplying each of the vectors by a realmatrix including the real parts of the matrix values and an imaginarymatrix including the imaginary parts of the matrix values. The resultantcomplex vector is then calculated from the partial results as is knownin the art.

Alternatively or additionally, complex vectors are represented in SPE102 by a single vector in which the real and imaginary parts are storedin alternate elements (e.g., the real parts of complex elements 1, 2, 3,. . . are stored in positions 1, 3, 5, . . . and the imaginary parts arestored in positions 2, 4, 6 . . . ). In accordance with thisalternative, when the matrix of VMM core 214 is used for complexmultiplications its elements have the form:

${M\left( {i,j} \right)} = \left\{ \begin{matrix}{i,{j\mspace{14mu}{odd}}} & \left. {{Re}\;\left\{ {{{C\left\lbrack {i + 1} \right)}/2},{\left( {j + 1} \right)/2}} \right\rbrack} \right\} \\{{i\mspace{14mu}{odd}},{j\mspace{14mu}{even}}} & {{Im}\left\{ {C\left\lbrack {{\left( {i + 1} \right)/2},{j/2}} \right\rbrack} \right\}} \\{{i\mspace{14mu}{even}},{j\mspace{14mu}{odd}}} & {{- {Im}}\left\{ {C\left\lbrack {{i/2},{\left( {j + 1} \right)/2}} \right\rbrack} \right\}} \\{i,{j\mspace{14mu}{even}}} & {{Re}\left\{ {C\left\lbrack {{i/2},{j/2}} \right\rbrack} \right\}}\end{matrix} \right.$where C[i,j] is a represented complex matrix used in the multiplication.

Alternatively or additionally, other complex vector representationmethods are used for the vectors and/or the matrix. In some embodimentsof the invention, VPU 206, pre-processor 302 and/or post-processor 314are adapted to convert vectors between formats and/or to correct vectorsresulting from multiplication by a complex matrix into a proper format.

Shift Register

Referring in more detail to shift registers 260, in some embodiments ofthe invention, shift registers 260 are adapted to perform cyclic and/ornon-cyclic shifting of the elements of the vector they store.Optionally, the shift may be performed either up (moving the value ofthe last vector element to a lower index position) or down (moving thevalue of the first vector element to a higher index position). Innon-cyclic shifting, vector positions that become vacant due to theshift are optionally filled with a predetermined value, e.g., zero, asis known in the art. In some embodiments of the invention, the shift maybe performed for any number of positions, the number of which optionallystated in the shift command. Alternatively, the shift is performed for apredetermined number of positions, e.g., one position.

In some embodiments of the invention, a shift may be performed in whichthe vectors in two or more shift registers 260 are shifted together.Optionally, the two or more vectors are taken to be considered an extralong vector, according to a predetermined register order. Alternativelyor additionally, a shift is performed together on two vectors which areviewed as a complex vector pair. In some embodiments of the invention,shift registers 260 are adapted to perform translation between differentcomplex vector formats, for example:{Re(1), Re(2), . . . Re(N)}, {Im(1), Im(2), . . . Im(N)}=>{Re(1), Im(1),Re(2), Im(2) . . . , Re(N), Im(N)}and/or to perform a complex conjugate operation:{Re(1), Im(1), Re(2), Im(2) . . . }, =>{−Im(1), Re(1), −Im(2), Re(2), .. . }Input/Output Interface

In some embodiments of the invention, the input and/or output ports 211and 221 comprise parallel units, allowing high rate data transferwithout very high timing constraints. Alternatively or additionally,input and/or output ports 211 and 221 comprise serial ports. Furtheralternatively or additionally, the received data comprises serial datathat is converted into parallel data by buffer 202 and/or by an externalcomponent (not shown). Further alternatively or additionally, any otherinput and/or output methods may be used including methods which involvereceiving one or more control, format and/or timing lines along with thedata. Optionally, input and/or output ports 211 and 221 may receive datain accordance with a plurality of different methods.

In an exemplary embodiment of the invention, ports 211 and/or 221receive 256 elements, each element having one byte (e.g., 8 bits). Eachelement is optionally provided on a separate serial line, driven at 1GHz, providing vectors at a rate of 125 MHz. In an exemplary embodimentof the invention, the serial lines are driven using differential linedriving, to overcome signal noise and interference problems. In anexemplary embodiment of the invention, the input (and/or the output) useLVDS (Low Voltage Differential Signaling) buffers and/or interfaces withtwo wires for each provided value. Other signal bussing methods may beused as well. It is noted that the data may be provided MSB first and/orLSB first.

It should be appreciated that in some applications SPE 102 performsseveral operations on input vectors before outputting the result, suchthat the data transfer rates of input and output ports 211 and 221 maybe slower than the operation rate of VMM core 204.

It is noted that the use of 256 element vectors is brought by way ofexample, and any other vector sizes may be used, according to theirutility in industry.

Vector Processing Unit

In some embodiments of the invention, VPU (vector processing unit) 206is adapted to perform element-by-element operations on one, two or morevectors. Optionally, the element-by-element vector operations providedby VPU 206 include XOR, OR, AND, multiply, average, subtract and/or add.Alternatively or additionally, VPU 206 is adapted to perform a rotateoperation on a pair of vectors (equivalent to multiplying byj=sqrt(−1)), e.g., v₁<=v₂, v₂<=−v₁. In some embodiments of theinvention, VPU 206 is adapted to perform vector multiplication,resulting in a scalar.

In some embodiments of the invention, VPU 206 is adapted to performsingle vector operations, such as the logical NOT operation, absolutevalue, negation, multiplication by 2^n, truncation and/or rounding.Optionally, VPU 206 is adapted to perform inter-vector conjugateoperations, such as, V[2n]=>V[2n−1], −V[2n−1]=>V[2n] and/orV[2n]+V[2n−1]=>V[2n−1], 0=>V[2n]. The specific conjugate operations usedare optionally determined according to the specific vectorrepresentations used for complex vectors. In some embodiments of theinvention, VPU 206 is adapted to perform vector format changingoperations, which are used to change the format of a vector as requiredfor further processing.

Alternatively or additionally, VPU 206 is adapted to performvector-scalar operations, such as adding, subtracting and/or multiplyingall the elements of a vector by a single scalar. In some embodiments ofthe invention, such vector-scalar operations are performed by settingall the elements of a temporary vector to the scalar values andperforming an element-by-element operation.

Further alternatively or additionally, VPU 206 is adapted to performintra-vector processing operations, such as rearranging (shuffling) ofvector elements and/or adding elements of a single vector to each other(for example, adding each two adjacent elements and replacing one by thesum and the other by zero). Additional intra-vector operations mayinclude selection of a portion of a vector and/or rearrangement of avector. Optionally, these additional operations are performed using asecond vector which states the selected elements and/or therearrangement order of the elements. Alternatively, these additionaloperations are performed according to pre-configured parameters.

In some embodiments of the invention, VPU 206 is adapted to performoperations on three or more vectors, for example very popular sequencesof operations, which can be performed more efficiently as a complexoperation than as a sequence of operations. For example, a calibrationoperation (e.g., each element is multiplied by an element-specificcalibration value and another element-specific value is subtracted) maybe implemented. In some embodiments of the invention, VPU 206 is adaptedto perform a carry operation between operations on two or more vectorelements, so that a pair (or greater number) of vector elements can actas a single, higher precision, vector element.

Alternatively or additionally, VPU 206 is adapted to perform in a singleoperation cycle, a sequence of mathematical and or logical operations,for example negating the elements of a vector, adding a scalar to eachelement of the vector and taking the absolute value of each vectorelement.

It should be noted that VPU 206, depending on the function performed,may receive as input one or more vectors and scalar values and output avector, a partial vector or a scalar. In some cases (e.g., peakdetection) an output pair is provided (e.g., location and value ofpeak). The data used for performing the processing of VPU 206 isoptionally provided from vector register file 213. Alternatively oradditionally, a separate or additional memory is used, for example forscalar values.

In some embodiments of the invention, VPU 206 is implemented by aprogrammable gate array (FPGA), an application specific integratedcircuit (ASIC) and/or any other dedicated hardware. In some embodimentsof the invention, VPU 206 includes a plurality of different elements fordifferent tasks. Alternatively or additionally, VPU 206 includes aplurality of identical units which may operate in parallel on differentvectors. Further alternatively or additionally, some of the operationsperformed by VPU 206 utilize shared hardware. For example, the add andsubtract operations may use the same hardware with a negation operationperformed for subtraction. Alternatively or additionally, some or all ofVPU 206 is implemented in software on a digital signal processor and/oron a general purpose processor.

In an exemplary embodiment of the invention, VPU 206 and/or hardwarecomponents thereof are replaceable, for example when reconfiguring SPE102 after manufacture. The replacement may be performed, for example, byinserting or replacing an application specific microcircuit.

In some embodiments of the invention, as described above, VPU 206comprises digital circuits. Alternatively, VPU 206 includes at least oneanalog processing circuit. Optionally, in some embodiments, SPE 102receives analog signals provided directly to VPU 206 (which may includean A/D converter).

Optionally, VPU 206 may operate at different precision levels, e.g., 8bits, 10 bits or 16 bits. Alternatively, VPU 206 has a same precision asthe input data to SPE 102, e.g., 8 bits precision. Furtheralternatively, VPU 206 has a higher precision than the input data to SPE102, for example, 10 or 16 bits. Optionally, controller 210 sets the VPUprecision to be used, in the command field of VPU 206 provided for eachoperation.

It should be noted that VPU 206 may operate on data before, in parallelwith and/or after data is processed by VMM core 204.

Parameter Extractor

In some embodiments of the invention, parameter extractor 242 is adaptedto perform a peak detection operation, which detects one or more highestor lowest values of a vector and/or detects a local area of the vectorincluding the peak value. Alternatively or additionally, a thresholdoperation detects values above or below a given threshold. Furtheralternatively or additionally, parameter extractor 242 is adapted toperform a maxima or minima operation adapted to find local maxima and/orminima points of the vector.

In some embodiments of the invention, in addition to single vectorcalculations, parameter extractor 242 is adapted to perform vectorsequence operations, for example, finding a peak vector or vectorelement over time (e.g., a history of vectors). Optionally, parameterextractor 242 includes an internal vector memory in which a maximalvalue for each element is stored, together with an index number of thevector in the sequence achieving the maximum. For each vector of thesequence, parameter extractor 242 compares each of the elements of thevector to the corresponding element of the internal vector memory, andupdates the internal memory if required. Alternatively, parameterextractor 242 may be used to find minimum vector elements in a sequence,to find first and/or last elements in a sequence passing a thresholdand/or to count the number of vectors passing a threshold for eachelement. Further alternatively or additionally, parameter extractor 242may determine a vector having a largest or smallest magnitude in asequence of vectors. In some embodiments of the invention, parameterextractor 242 is incorporated into the VPU.

Timing Issues

In some embodiments of the invention, input and/or output ports 211 and221 operate at a different clock rate than VMM core 204. Optionally, DSP214 and/or controller 210 have separate clock cycles from each otherand/or from VMM core 204 and/or input and output ports 211 and 221. Theuse of separate clock rates allows utilizing maximal processingresources of each unit, without slowing one of the units because ofothers and/or requiring expensive high rate units that are notnecessary. For example, as vectors are generally handled by SPE 102 forseveral cycles, the input and/or output rate may be slower than the rateof operation of VMM core 204. In addition, the use of separate clockrates for the different units allows exchanging or upgrading parts ofSPE 102, e.g. DSP 214, without redesigning the entire SPE.

In an exemplary embodiment of the invention, VMM core 204 and VPU 206operate with a single clock cycle (e.g., 125 MHz).

It is noted that in order to allow operation with an externalenvironment, in some embodiments of the invention, all that is requiredis to provide interfaces that connect to input and output ports 211 and221 and to host interface 212. Otherwise, the external environment doesnot need to accommodate to the operation rate of SPE 102.

In some embodiments of the invention, as described below, VMM core 204is implemented by an electro-optic core. In an alternative embodiment ofthe invention, an analog electrical core is used, for example, asdescribed in “Programmable Analog Vector-Matrix Multipliers”, by F. Kub,K. Moon, I. Mack, F. Long, in IEEE Journal of Solid-State Circuits, vol.25 (1) pp. 207-214, 1990 or “Charge-Mode Parallel Architecture forMatrix-Vector Multiplication,” R. Genov, G. Cauwenberghs, IEEE Trans. onCircuits and Systems II: Analog and Digital Signal Processing, October2001, the disclosures of which documents are incorporated herein byreference. One potential advantage of light, however, is that light canbe more efficiently fanned out, do to its low attenuation. The use of anoptical VMM core generally achieves a higher processing speed thanelectrical VMM cores, as optical units do not have capacitance. The factthat optical processing elements do not have capacitance alsopotentially reduces cross-talk effects.

Electro-optical Core Overview

FIG. 2A is a schematic block diagram of an electro-optical VMMsub-system 199, in accordance with an exemplary embodiment of theinvention. Reference is also made to FIG. 2B, which is a schematicillustration of an optical portion 300 of electro-optical VMM core 204,in accordance with an exemplary embodiment of the present invention.Electro-optical VMM core 204 optionally includes a semi-transparentmatrix 308, formed of window elements 311 (FIG. 2B) having transparencyvalues (e.g., between 0 to 1) corresponding to values of a mathematicalmatrix to multiply input vectors. Vectors to be multiplied by matrix 308are received from pre-processor 302 and converted into analog voltagesignals, using a digital to analog converter (D/A) 305. Optionally, D/A305 includes a driver which amplifies the analog signals to suitablelevels. The analog voltage signals are optionally converted to lightsignals by a light source array, such as a VCSELs unit 304.

Optionally, the light from each light source in VCSEL array 304 isspread out by fan out optics 306 so as to pass through an entire columnof elements 311 of matrix 308, as required in vector-matrixmultiplication. The light passing through matrix 308 is optionallyconverged by fan-in optics 310, such that the light from each matrix rowis directed to a respective detector in a light detector array 312. Thedetectors of array 312 optionally convert the light into analog voltagesignals which are amplified by amplifiers 313 and converted to digitalsignals by an analog to digital (A/D) converter 315. The digital signalsare then optionally passed for post processing to post-processor 314.

Matrix

In some embodiments of the invention, the attenuation values of elements311 of matrix 308 are programmable. Optionally, matrix 308 is an SLM(spatial light modulator) with elements 311 that can be amplitude and/orphase controlled. Optionally, a GaAs SLM, in which the attenuation valueof each element is controlled by a respective electrical voltage, isused. In an exemplary embodiment of the invention, matrix 308 comprisesan MQW (multi-quantum wells) light modulator, in which fast valuechanging (i.e., a fast settling time) at a rate of about a fewnano-seconds is possible. In a particular exemplary embodiment of theinvention, a matrix change is achieved in about 1-4 microseconds. Ifabout 30,000 matrix changes are performed, between about 3-12% of thetime is required for matrix changing.

Optionally, DSP 214 (FIG. 1) generates mathematical values of a matrixto be multiplied. The mathematical matrix values are optionallyconverted into attenuation values of elements 311, by matrixpre-processor (MPP) 225 as described below. The values from MPP 225 areoptionally provided to a matrix driver 227 which sets the actualattenuation values of elements 311. In some embodiments of the inventionmatrix driver 227 is coupled to matrix 308 using a flip-chip bonding.Alternatively or additionally, the coupling of matrix driver 227 to theelements 311 of matrix 308 uses chip to chip bonding methods known inthe art. Optionally, matrix driver 227 includes a flash digital toanalog converter (DAC) for each element. The flash DAC optionally allowsupdating matrix values at a relatively fast rate (e.g., in less than 1microsecond). In some embodiments of the invention, the flash DACcomprises a ramp DAC as described in a US application Ser. No.10/234,632, filed on Sep. 3, 2002, and titled “DIGITAL TO ANALOGCONVERTER ARRAY”, the disclosure of which is incorporated herein byreference. Alternatively or additionally, driver 227 includes one ormore calibration settings, by which predetermined matrixes are loaded,for example, all at minimum value, all at maximum value, checkerboardsand/or all at an average value or other uniform value.

In some embodiments of the invention, as described above, a matrixmemory 220 stores matrix values for fast replacement of the attenuationvalues of elements 311. In one example, each matrix element and/orcolumn has an associated memory in which replacement values are stored.Thus, the matrix can be updated in a single clock cycle or in anotherwise short period. Alternatively or additionally, the matrix valuescan be shifted, for example by one or more rows down or up according tothe equation M[j,k]<=M[j+N,k], where k is a positive or negativeinteger, optionally a small integer (e.g., 1 or 2). Alternatively oradditionally, a shift may involve the movement of each matrix value oneor more columns to the right or left. The shift may be cyclic or mayinvolve insertion of new values in places emptied by the shift(non-cyclic). Such matrix shifts may be used in running filters, forexample in signal identification applications.

In some embodiments of the invention, a mathematical matrix larger thancan be accommodated by VMM core 204 is stored in matrix memory 229. At afirst stage a first portion of the matrix is loaded into VMM core 204for multiplication. Thereafter, the values are shifted in order to loadthe remaining part of the large matrix into VMM core 204.

Optionally, matrix memory 220 stores mathematical values before they arepre-processed by MPP 225. Thus, the stored values remain usable evenwhen the pre-processing rules change. Alternatively, matrix memory 220stores pre-processed matrix values, so as to allow faster loading of thestored values into matrix 308. Further alternatively, matrix memory 220stores for at least some mathematical matrices both pre-processed andnot preprocessed values, which can be used according to thecircumstances.

In some embodiments of the invention, the matrix element values are allreplaced at once. Alternatively or additionally, matrix driver 227 mayreplace a single row and/or column of the matrix and/or a single elementof the matrix. Further alternatively or additionally, matrix driver 227may replace any and/or predetermined rectangular portions of the matrix.

Alternatively, to the values of elements 311 of matrix 308 beingchangeable, elements 311 have fixed attenuation values which aredetermined during production and/or during factory calibration. Forexample, the values may be in accordance with a specific code for whichthe multiplication is performed.

In some embodiments of the invention, the matrix elements and multipliedvectors are represented by 8 bit vectors. Alternatively, other elementsizes may be used. In a specific embodiment, a very small element sizeis used, for example having only the values {−1, 0, 1}. Such small sizeelements are easily stored and manipulated and have less chances oferror. In some embodiments of the invention, VMM sub-system 199 mayoperate in a full element size mode (e.g., 8 bits) or in a reducedelement size mode (e.g., 1 or 2 bits), according to the specificapplication processed. The reduced element size mode, may be used fordata of a low value range or may be used for bit-planes of full sizeelements, as described below.

Light Source and Detectors

Detector array 312 may include substantially any detector type known inthe art, such as a monolithic silicon photodiode or a CMOS array.Alternatively, other photo-detectors may be used, for example detectorscomprising GaAs or Ge.

Alternatively or additionally to using a VCSELs unit 304 as a lightsource, other light sources may be used, for example, a pulsed laser ora LED source. In some embodiments of the invention, a continuous wave(CW) laser source is used with light modulators such as liquid crystaldisplays (LCDs), acousto-optic modulators and/or MQW modulators. Furtheralternatively or additionally, wave propagation devices, such as theLitton “MO-SLM” (Magneto Optic SLM) device are used.

In some embodiments of the invention, instead of using a single lightsource for each vector value, a plurality of light sources are used forone or more of the vector values. The use of a plurality of lightsources optionally provides more efficient and/or uniform lightproduction and/or a better signal to noise ratio. The fan out optics areoptionally adjusted according to the number of light sources used. In anexemplary embodiment of the invention, each vector value is generated byan entire column of light sources, such that each matrix element 311receives light from a respective light source. In this exemplaryembodiment, there is much more flexibility in matching mathematicalvalues to light sources and/or matrix elements allowing better avoidanceof defective elements, for example as described below. This embodimentis most suitable for use when relatively cheap light sources are used.In some cases, when the data vector is smaller than the maximum abilityof VMM core 204 to handle, data elements may be duplicated, in adjacentor non-adjacent location, for example all values or values with a lowintensity.

Alternatively to using a single detector for each row, in someembodiments of the invention, a plurality of detectors are used for eachrow, for example in order to increase the accuracy of the detectors (byreducing the total amount of light impinging on a single detector).Optionally, any of the arrangements described above for the lightsources may be used for the detectors. For example, VMM core 204 mayinclude a detector for each matrix element 311. The detected values fromthe plurality of detectors of each row are optionally added in analogcurrents, or are converted to digital values and added in digital form.

In some embodiments of the invention, one or more operation parametersof VCSELs 304 and/or detectors 312 are software controlled and/orconfigurable. For example, a base beam power level and/or a power levelrange of VCSELs 304 may be changed. In some embodiments of theinvention, the power level of each VCSEL may be controlled separately,for example, for example to compensate for local production defects inspecific VCSELs 304 and/or specific rows or columns of matrix 308.Alternatively or additionally, the power level of some or all of VCSELs304 is controlled together, for example, according to a desiredcompromise between power consumption and accuracy. The power level ofVCSELs 304 is optionally set as a trade off between accuracy, whichrequires higher power levels, and reducing heat effects, which requireslower power levels.

Adjustable parameters of detectors 312 may include, for example, theircollection time, an amplification gain of amplifiers 313 and/or ananalog bias for dark current subtraction.

In some embodiments of the invention, VMM core 204 includes one or morein-core controllers and/or drivers which control the parameters of theVMM core. Alternatively or additionally, the parameters of VMM core 204are controlled by controller 210 (FIG. 1).

Error Reduction

A potential disadvantage of an optical implementation of VMM core 204 isthat in some implementations a higher error rate, as compared toelectrical digital devices, can be expected. In an exemplary embodimentof the invention, SPE 102 is used for applications where the data isoriginally noisy and/or where the applied algorithms fail softly. Oneexample is communication systems, as exemplified below in a cellularcommunication system, in which error correction and/or detection methodsare used.

Alternatively or additionally, error correction methods and/or errorreduction methods are implemented by SPE 102. The error correctionmethods may include, for example, adding error correction bits to theprocessed data Optionally, the error reduction methods includeperforming the same processing operation a plurality of times andaveraging the result. The processing speed advantage due to the use ofVMM core 204 is much larger than the additional processing powerrequired for the error correction and/or compensation methods.

In general, a VMM architecture as described herein, may be useful whereextensive calculations are required. In some embodiments of theinvention, existing algorithms are redesigned for use with a VMMarchitecture, for example so that they fail softly on errors in thedata, implementation of the algorithm and/or calculations.

In some embodiments of the invention, SPE 102 is used in applicationswhich require performing of one or more transformations, such ascorrelations, convolutions, permutations, filters, and Fouriertransforms (e.g., DFT, IDFT, DCT (discrete cosine transform), IDCT, DST(discrete sine transform), IDST). Such applications may include, forexample, rake receivers, multi user detection (for example, as describedin a PCT application filed on even date as the instant application, andhaving the title “MULTI-USER DETECTION”, the disclosure of which isincorporated herein by reference), third generation base stations asdescribed hereinbelow and/or smart antennas. Further applicationsoptionally include DSL modems (e.g., ADSL, VDSL), image processing,spectrum analysis, echo cancellation, software defined radio, weatherprediction, signal processing and/or wireless applications MMDS, LMDS).SPE 102 may operate as a general purpose processor or may be used for aspecific dedicated application.

Pre- and Post-processing

In some embodiments of the invention, pre-processor 302 and/orpost-processor 314 are adapted to perform one or more tasks which aredirected to reducing the effect of inaccuracies of VMM core 204.Optionally, pre-processor 302 changes input vectors in a manner whichcauses the errors to have a lesser effect on the output vector and/orfurther computations. Post-processor 314 optionally reverses the changesapplied by pre-processor 302, so that the changes do not affect theresult. In some embodiments of the invention, the attenuation values ofmatrix 308 are changed and/or rearranged, and the changes ofpre-processor 302 conform the input vectors to the changes in matrix308. Alternatively or additionally, the changes of pre-processor 302 donot require any changes in matrix 308. Some of the pre-processing tasksare optionally performed according to calibration results. Exemplarycalibration procedures in accordance with embodiments of the presentinvention are described hereinbelow.

In an exemplary embodiment of the invention, pre-processor 302 isadapted to perform one or more of the following pre-processing tasks:

(a) signed to unsigned conversion of input values (e.g., v(i)<=v(i)+128for 8 bit element vectors). This conversion is optionally used, when SPE102 is used for signed data, for example in the range [−128,127], incases in which VMM core 204 operates only on positive values.

(b) applying a non-linearity correction function to the input values.Optionally, the non-linearity correction function is implemented by alook up table (LUT). In some embodiments of the invention, the functionprovides an output with a larger number of bits than the input, forexample 9-bits for an input of 8-bits (e.g., for results of addition).The non-linearity correction function is optionally used to correct fornon-linearity in the current output of driver 227, the light output ofVCSELs 304, the current of the VCSEL drivers and/or non-linearity ofother elements of VMM core 204. Alternatively or additionally, a LUT isused to implement other pre-processing tasks, such as sign inversion,scrambling, gain change and/or offset correction (changing the range ofthe vector values to fit an operation range of the VCSELs).

(c) pre-correction of the values according to errors found duringcalibration. In some embodiments of the invention, the pre-correction isapplied using the function v(i)<=a* v(i)+b, where a, b are correctionvalues having different values for each i, or having the same values forall i. In some embodiments of the invention, constants a and b havepredetermined values or are determined periodically. Alternatively oradditionally, constants a and b vary with time, according to a knownpattern of ripple i variations of the attenuation of the matrixelements, for example when the matrix comprises an LCD.

In some embodiments of the invention, the light output from each laserL(j) is given by L(j)=a+b*v(j) where v(j) is the value of the element jin the input vector. Light from the laser depends on the current I(j) ofthe laser driver, which is approximately L(j)=Eta*(I(j)−I₀) where Eta isthe slop efficiency of the laser and I₀ is the threshold current. BothEta and I₀ may depend on temperature and may vary from one laser to thenext. Optionally, a linear transformation between the vector value v(j)and the digital value provided to the current driver, is used tocompensate for variations in Eta and I₀. The transformation isoptionally calculated each time a value is to be used, or duringcalibration.

The variable “a” optionally has a small positive value, such that lightis always produced by VCSELs 304, even when the represented value iszero (e.g., after signed to unsigned conversion). This possibly preventsslowdown of VMM core 204 due to the need to restart VCSELs 304 afterthey are shut off. Alternatively or additionally, when a VCSEL 304 isnot used for a relatively long period, for example, when a whole set ofvectors has padded zero elements, the VCSELs are shut off.

(d) rearranging the values of the input vector according to anarrangement of the matrix values in the matrix. In some embodiments ofthe invention, the values of the matrix are shuffled in order to reducecross-talk and/or to use redundant matrix elements instead of defectedelements. Optionally, pre-processor 302 reorders the vector elementsaccording to the matrix arrangement.

(e) scrambling (reversing the values of some of the elements of theinput vector). In an exemplary embodiment of the invention, evenelements (or odd elements) of the input vector are reversed (i.e., theirsign is changed) in order to reduce the effect of correlated noiseand/or to randomize the effect of errors. Such noise may be due todifferent power levels used for negative and positive values, whichdifferent power levels may cause change in temperature and/or othersystem parameters.

Optionally, the corresponding matrix values (e.g., the values of evencolumns) are inverted accordingly, such that the result does not change.Alternatively or additionally, the post-processing compensates for thechange in the value. In some embodiments of the invention, the inversionof 8-bit positive values (i.e., in the range [0,255]) is performed bythe function v(i)<=255−v(i).

(f) repeated multiplication for extended accuracy. In some embodimentsof the invention, pre-processor 302 is adapted to provide the same inputvector (optionally pre-processed in different ways, for examplescrambling) for several consecutive cycles. Optionally, some or all ofthe input vectors are multiplied by the matrix a plurality of times.Post-processor 314 optionally averages the results of the plurality ofmultiplications and provides the average as output. Thus, sporadicerrors in the multiplication result may be reduced. In some embodimentsof the invention, the number of times an input vector is multiplied bythe matrix depends on the accuracy required verses the desiredprocessing speed. In some embodiments of the invention, post-processor314 performs the averaging digitally. Alternatively or additionally, theaveraging, or summation thereof, is performed in an analog circuit, inelectrical and/or optical form.

In some embodiments of the invention, extended accuracy is achieved byrepeating the multiplication twice, once with the original input vectorand once with a negated form of the input vector. The vector results ofthe two multiplications are optionally averaged to form a result vectorY(k), using the function:Y(k)=(A*v−A*(−v))/2where A*v denotes the multiplication of the input vector v with thematrix A. This enhanced mode of operation possibly minimizes someinaccuracies related to drift of the components and some of the randomnoise associated with the VMM operation. In some embodiments of theinvention, the vector negation is performed by pre-processor 302 andpost processor 314 performs the subtraction and division by 2.Alternatively, some or all of these operations are performed by VPU 206.

(g) history cancellation. Optionally, pre-processor 302 corrects forhistory effects in VMM core 204, e.g., in VCSELs 304 and/or detectors312, by reducing from the input vector a fraction of the previous inputvector. In an exemplary embodiment of the invention, the functionV(i)<=(1−K)*V(i)−K*V′(i) is used, where V′(i) is the previous inputvector and K is a constant. Optionally, K is a small constant and issame for all i. Alternatively, K is different for different i, forexample, in order to compensate for different history effects indifferent VCSEL light sources. Alternatively, the history correction maytake into account more than one previous input vector.

A particular example of history cancellation is temperaturecompensation. In some embodiments of the invention, history cancellationis designed to correct for problems relating to different heating ofeach VCSEL due to the current flowing through it as it relates to theinput value Vin(t) where t is the time. Vin(t) is a number between −127and +128. Alternatively or additionally, temperature compensation isused to correct for differential cooling in a device and/or tocompensate for hot spots or other sources of temperature gradients,un-designed for temperatures and/or varying temperatures. For correctmultiplication, it is generally desirable to maintain a linear relationbetween Vin(t) and light output L(t):L(t)=L0+L1*(Vin(t)+127),where L0 and L1 are system parameter which is the same for all thechannels.

Light output of a VCSEL can be modeled to depend on the driver currentI(t) as:L(t)=E(T)*(I(t)−Io(T)),where both the slop efficiency E(T) and the threshold current Io(T)depend on the temperature T.

Typical dependencies are:E(T)=E0−E1*T and Io(T)=Io0+Io1*T+Io2*T2,where Eo, E1, Io0, Io1 and Io2 depend on the construction of the VCSELand properties of the specific laser and may vary among the lasers inthe same array.

In an exemplary embodiment of the invention, the VCSEL driver controlsthe current I(t) by placing a number V(t) at the DAC so that the currentis I(t)=C0+C1*Vin(t).

Stabilizing (or measuring and compensating for) the temperature of thelaser casing is, in some cases, insufficient, since the laser is heatedby the current passing through it, while it is cooled by conduction tothe case.

In an exemplary embodiment of the invention, the temperature is computedfor each laser just before pulse t by knowing the history of heatdeposited by the preceding pulses t−1, t−2 (or more).

Using the heat equation, the temperature at time t can be approximatedby:T(t)=T1*(T(t−1)−To)+T2*H(t−1),where: T0 relates to the case temperature

T1 relates to the rate of cooling

T2 relates to rate of heating by the current.

H(t−1) is the amount of heat deposited by the last (t−1) pulseH(t−1)=V(t−1)*I(t−1)−L(t−1),where V(t−1) is the voltage on the laser while Current I(t−1) is flowingthrough it. A suitable approximation is: H(t−1)=T3+T4*I(t−1) Where T3and T4 are system constants.

In an exemplary embodiment of the invention, it is the purpose of thetemperature history correction to ensure that the above linearrelationship will be maintained.

For this purpose, associated with each VCSEL is a processor that followthe temperature of the VCSEL according to T(t)=a+b*T(t−1)+c*Vin(t−1) andapplies a correction to the input value Vin(t);V′in(t)=A+B*Vin(t)+C*Te(t)Vin(t)+D*T(t)+E*T(t)2+F*T(t)2*Vin(t),where a, b, c, A, B, C, D, E, F are optionally calculated from modelsdescribing the laser operation and by solving the above equations or byinferring them from measurements on each laser in the array duringcalibration of the VMM.

As a practical matter, some of the coefficients (e.g. E and F) may bevery small and ignored.

In an exemplary embodiment of the invention, the accuracy to which thecurrent is corrected is about half the difference between two lightlevels. Thus, if Vin(t) is an 8-bit value, the result of thiscalculation (and the current controlling DAC) is done in 9-bits.

In an exemplary embodiment of the invention, it is noted that thecorrection to the various values (e.g., Room Temp, History andNon-Linearity) is in the last few (˜4) bits, all these calculations.Optionally, these calculations are carried out with very limitedaccuracy and ignoring all small values. Dedicated hardware (e.g. an ASICor a FPGA) is optionally used to perform these calculations.

Alternatively or additionally, other temperature compensation methodsmay be used, for example, using a calibration step determine a mappingbetween current value and actual light output (e.g., as detected by adetector). This can be used, for example to generate a table or toapproximate the above or other physical model and/or a correctionfunction (e.g., a polynomial of a low order or a piece-wiseapproximation). Alternatively or additionally, one or more temperaturesensors may be used. Alternatively or additionally, atemperature-sensitive circuit (e.g., a resistor) may be provided inconjunction with each VCSEL, to correct for at least some temperatureeffects.

(h) boosting of input values directed to lasers far from the center ofVCSELs 304, in order to compensate for fan in spatial inefficiency.Alternatively or additionally, boosting and/or reducing the gain may beperformed for other reasons.

(i) limiting the dynamic range of the input values of some or all theelements of the input vector. In some embodiments of the invention, someof the elements of matrix 308 may have a limited dynamic range, forexample, due to production inaccuracies. Similarly, elements of VCSELs304, detector 312 and/or other elements of VMM core 204 may have limiteddynamic ranges or may have values which they handle with low accuracy.Optionally, in these embodiments, pre-processor 302 limits the range ofthe input values into a range in which the elements of, for example,matrix 308 are relatively accurate. In some embodiments of theinvention, for simplicity, all the elements of the input vector arelimited to the same range. Alternatively, each element of the inputvector may be set to a different dynamic range according to respectiveelements in matrix 308.

(j) extending the dynamic range. In some embodiments of the invention,the input values, in some cases, are known to be limited to a smallrange, e.g., between 0-15. Optionally, in these cases, pre-processor 302extends the range of the values over the entire dynamic range of VMMcore 204, in order to achieve a higher accuracy. Optionally, if one ormore of the input bits is known to be constantly zero (or constantly‘1’), the extension of the range is achieved by shifting the data and/orselecting the significant bits (i.e., the bits not always zero) andputting them in the most significant positions.

In some embodiments of the invention, pre-processor 302 partitions eachof the input vectors into two or more portion vectors which togetherrepresent the input vector. For example, pre-processor 302 may perform abit-plane decomposition, in which the elements of the input vector arepartitioned into single bits or groups of bits. Alternatively oradditionally, the input vectors may be split into two or more vectorswhose sum or element by element multiplication is equal to the inputvector. The partitioning may form vectors of same or different sizes.Each portion vector is optionally multiplied by matrix 308 separately,and the result vectors are optionally combined by post-processor 314.Alternatively or additionally, the combination of the result vectors isperformed while the results are represented by light beams (e.g.,between fan in optics 310 and detectors 312) or in analog electricalcurrents (e.g., between detectors 312 and A/D converter 315, or withindetectors 312 which collect the light of a plurality of multiplicationcycles).

Partitioning the vectors into portions (e.g., bit planes or pairs ofbits) possibly allows higher accuracy handling of each portion, andaltogether more accurate processing, possibly at the expense of sloweroperation. In some embodiments of the invention, the partitioning extentof the input vector may be adjusted at run time and/or configured atinitialization (or after manufacture) according to a desired compromisebetween accuracy and run time speed.

In some embodiments of the invention, each bit plane is multiplied as anumber of its own, without taking account of the position of thebit-plane in the original value. For example, in partitioning the number‘11101001’, multiplication is performed, for example, for the numbers‘1110’ and ‘1001’. Post-processor 314 optionally shifts the result ofthe first multiplication before adding the results together.Alternatively or additionally, each bit plane is multiplied as a numberwhich stores its position in the original value. For example, inpartitioning the number ‘11101001’, multiplication is performed, forexample, for the numbers ‘11100000’ and ‘1001’. In this alternative, theresults may be simply added without correction or shifting, in eitherdigital (e.g., by post-processor 314) or in analog (either electrical oroptical). In an exemplary embodiment of the invention, partitioning avector into portions is non-uniform. For example, more lower order bitsare grouped together than higher order bits.

Further alternatively or additionally, one or more of the bit-planes ofthe higher order portion of the original value stores its position, butis shifted to the right by one or more positions, to enhance accuracy.Optionally, instead of shifting the result by post-processor 314, themultiplication is performed a predetermined number of times, as requiredinstead of the shift, and the results of all the multiplications areadded. Optionally, more repeated multiplications are performed forhigher order bits than for lower order bits, since the accuracy of themost significant part of the number is typically most important inpreventing errors. In an exemplary embodiment of the invention involving8 bit input values, the input vectors are split into 2 most significantbits and 6 least significant bits. The multiplication of the 2 mostsignificant bits are optionally performed twice or four times while themultiplication of the 6 least significant bits is performed once.

Alternatively or additionally to performing the different bit-planemultiplications at different times, the different multiplications may beperformed in parallel on different portions of the matrix (when thematrix is larger than the input vectors) and/or on a plurality ofparallel VMM cores. The different portions and/or matrices may hold thesame values or may hold different values required for the specificvector element portions they are to handle.

In an exemplary embodiment of the invention, in addition to cooperationwith pre-processor 302, post-processor 314 is adapted to perform one ormore of the following post-processing tasks:

(k) post-correction of the values according to errors found duringcalibration. In some embodiments of the invention, the post-correctionis applied using the function v(i)<=a* v(i)+b, where a, b are correctionvalues having different values for each i, or having the same values forall i.

In an exemplary embodiment of the invention, the gain a and the offset bare determined by noting that the operation to be performed by the VMMis: Y=A X; where A is a matrix, and X and Y are vectors. This may berepresented as:Y(k)=Sum[A(j,k)*X(j)]; for j={1, N}

Elements of X are represented by the light of the lasers, which cannotdirectly represent negative numbers, in some embodiments of theinvention. In these embodiments, the light L(j) =a+b*X(j)

Similarly, the reflectance of the matrix R(j,k) has a finite contrastratio as well as the non-negativity: R={Rmin to Rmax} where:0<Rmin<Rmax<1.

Thus, the reflectance R(j,k)=c+d*A(j,k)

The light collected on a detector D(k) is the sum of all the lightdirected to it by the optics. So with perfect optics:D(k)=Sum[R(j,k)*L(j)]; for j={1, N}=Sum[(c+d*A(j,k))*(a+b*X(j))=N*c*a+c*b*Sum[X(j)]+d*a*Sum[A(j,k)]+d*b*Sum[A(j,k)*X(j)]

Examining these results we see that: C1=N*c*a

d*a*Sum[A(j,k)] for j={1, N} is a value different for each column “k”and depends on the values of the matrix elements. Once a matrix isloaded, this factor does not generally change (e.g., except, for examplefor temperature effects). Since matrix replacement is relatively rare,and, in some applications a set of a small number (e.g., 2 or 3) ofmatrixes is cycled, it is possible to digitally calculateC2(k)=d*a*Sum[A(j,k)] for j={1, N} and use it for correction.

SumX=c*b*Sum[X(j)] depends on the vector X which changes every VMMoperation.

SumX is optionally calculated by digital summation.

Alternatively, at least one, and preferably several detectors D(K) arededicated for this purpose. In an exemplary embodiment of the invention,the matrix elements in column K are kept at constant value “1”: A(j,K)=1

Then D(K)=N*c*a+c*b*Sum[X(j)]+d*a*Sum[A(j,k)]+d*b*Sum[A(j,k)*X(j)]

We use the two already known corrections to calculate the last:D(K)−(N*c*a+d*a*Sum[A(j,k)])=d*b*Sum[1*X(j)]c*b*Sum[X(j)]=(N*d*b+N*c*b)*Sum[X(j)]SumX=(D(K)−C1−C2(K))/(N*d*b+N*c*b),

To extract Y(k) we correct the measurement D(k) by:Y(k)=D(k)−C1−C2(k)−c*b*SumX

In an alternative method, two dedicated detectors are used: D(K) andD(L), where A(j,L)=0.5

Then:D(K)=N*c*a+c*b*Sum[X(j)]+d*a*Sum[1]+d*b*Sum[1*X(j)]D(L)=N*c*a+c*b*Sum[X(j)]+d*a*Sum[0.5]+d*b*Sum[0.5*X(j)]

Then:D(K)−D(L)=N*d*a/2+d*b*SumX/2Or SumX=2*(D(K)−D(L)−N*d*a/2)/d*b

The constants a, b, c, d are parameters of the system that generally donot change during the operation. Calibration procedures are set up tocalibrate and maintain their value.

These calibration methods could be used by performing measurements withknown a X

e.g. X={255, 255, . . . } and X={127, 127, . . . } to extract a, b, c, dand C2(k).

It is should be noted that the values 1, 0.5, 255 and 127 for A and or Xcould be replaced with other values and combinations of values. In someembodiments of the invention, for example when the matrix is unbalanced,with some lines having many zeros, different matrix lines (and/or dataelements) may be weighted so that the average detected value is within arelatively small range. One case where this might happen is in CDMAcalculation, where a code matrix may include many zeros and especiallymore zeroes for short codes as compared to long codes. In this case,line representing long codes may use smaller values, e.g., inverselyrelated to their code length.

(l) applying a non-linearity correction function to the output values.Optionally, the non-linearity correction function is implemented by alook up table (LUT). In some embodiments of the invention, the functionprovides an output with a larger number of bits than the input, forexample 9-bits for an input of 8-bits. The non-linearity correctionfunction is optionally used to correct for non-linearity in VMM core 204and/or in other elements of SPE 102.

(m) history cancellation. Optionally, post-processor 314 corrects forhistory effects in detectors 312, by reducing from the output vector afraction of one or more of the previous output vector. In an exemplaryembodiment of the invention, the function V(i)<=(1−K)*V(i)−K*V′(i) isused, where V′(i) is the previous output vector and K is a constant.Optionally, K is a small constant and is same for all i. Alternatively,K is different for different i, for example, in order to compensate fordifferent history effects in different detectors. Alternatively, thehistory correction may take into account more than one previous outputvector.

(n) matrix column sign inversion. In some embodiments of the invention,for each column k having a high average reflectance (e.g., representingelements having a positive average), the reflectance values of thecolumn are reversed. In an exemplary embodiment of the invention, thisinversion is performed in order to reduce the level of shot noise indetectors 312, which noise is generally proportional to the square rootof the light intensity impinging on the detector. Preprocessor 314optionally inverts the detected values of columns which were inverted.

In some embodiments of the invention, when the processed vectors andhence the mathematical matrix multiplying the vectors, are smaller thanmatrix 308, the additional elements 311 of matrix 308 are used to reduceerror levels. Optionally, some of the values of the mathematical matrixare represented by a plurality of matrix elements 311. The light signalto be multiplied by a value of the mathematical matrix is optionallypassed through all the plurality of matrix elements 311 representing themathematical matrix value. The resulting values are optionally averaged,so as to provide better accuracy results. The matrix elements used forthe duplicated values may be adjacent each other, in order to reducecross-talk effects, or may be separated from each other, in order toavoid local inaccuracies of the matrix.

For example, in some embodiments of the invention, repeatedmultiplication for extended accuracy is performed for small vectors byduplicating the input vectors and the matrix elements which are tomultiply the input vectors. For example, when a VMM sub-system 199having 256×256 elements in its matrix 308 is used to multiply a vector vof 128 elements by a mathematical matrix A of 128×128 elements, theactual multiplication performed by VMM sub-system 199 is:

$\begin{bmatrix}r_{1} \\r_{2}\end{bmatrix} = {\begin{bmatrix}A & 0 \\0 & A\end{bmatrix}\begin{bmatrix}{??} \\{??}\end{bmatrix}}$and the result vector r is calculated as the average of r₁ and r₂.Alternatively, one of the instances of vector v or of matrix A isnegated, as discussed above. It is noted that if the matrix is smallerthan 128 elements the remaining positions may be padded with zeros.Alternatively or additionally, additional instances of the vector may bemultiplied by additional matrix instances. In some embodiments of theinvention, for example, when there is not enough room for an entireadditional instance of vector v in core 204, a portion of the vector isrepeated for multiplication, for example according to the values of thematrix elements involved in the multiplication.

Alternatively or additionally, the input vector may be used to representvalues with more than 8 bits, for example a pair of vector elementsrepresenting 16 bits. The post processing, VPU and/or DSP may be used toeffect a carry function. Optionally, the two element parts overlap inone or two bits, for example two 8 bit elements representing a 14 bitvalue.

Alternatively or additionally, when fewer than all the matrix elements311 are required, the matrix elements used are separated from each otheras much as possible, in order to reduce cross talk effects. Differentmatrix values may be separated by different amounts.

In some embodiments of the invention, the operation speed of VMM core204 is adjusted according to the required accuracy of the results.Optionally, when high accuracy results are required, VMM core 204 isoperated at a relatively low speed.

The pre-processing and/or post-processing tasks to be performed on aspecific input batch are optionally programmed into SPE 102 according tothe specific input and the accuracy and/or speed required from itsprocessing. Alternatively or additionally, the pre-processing and/orpost-processing tasks are selected according to an error level of matrix308 and/or VCSELs 304.

Pre-processor 302 and/or post-processor 314 are optionally implementedby a field programmable gate array (FPGA) and/or an application specificintegrated circuit (ASIC). Alternatively or additionally, pre-processor302 and/or post-processor 314 are implemented on a digital signalprocessor, a general purpose processor and/or any other processor,either the same as DSP 214 or a different stand alone processor.

Handling Large Input Vectors

In some embodiments of the invention, when input vectors having a numberof elements greater than the number of rows in matrix 308 need to bemultiplied, a plurality of multiplications are performed betweenportions of the vector and respective matrix portions. VPU 206 isoptionally used to consolidate the resulting vectors.

In some embodiments of the invention, when long vectors are to bemultiplied by a large matrix which cannot be represented at once bymatrix 308, the order of multiplication is arranged to minimize the rateof matrix reload (change of values in matrix 308). As an example, whenseveral complex value vectors are multiplied by a matrix, all thevectors are optionally first multiplied by a sub-matrix representing thereal part of the matrix, then all the vectors are multiplied by asub-matrix representing the imaginary part of the matrix and the resultsare summed appropriately, as described above.

Cross Talk Cancellation

In some embodiments of the invention, the attenuation values a(i,j) ofthe elements of matrix 308 are adjusted to compensate for inaccuraciesin the directing of light by fan-out optics 306 and/or fan-in optics310. Assuming accurate light handling by fan-out optics 306 and fan-inoptics 310, the light P(i,j) impinging on each matrix element is equalto 1/N *V(i), where N is the number of elements in a column of matrix308, and V(i) is the light intensity of VCSEL i. The light collected byeach detector 312 is therefore given by:

${D(j)} = {{\sum\limits_{i}{{a\left( {i,j} \right)}{P\left( {i,j} \right)}}} = {\frac{1}{N}{\sum\limits_{i}{{a\left( {i,j} \right)}{V(i)}}}}}$

In some cases, the light V(i) from VCSELs 304 is distributed not only onthe matrix elements of the i'th column but also, in a lesser degree, onneighboring columns. Optionally, the light P(i,j) actually impinging oneach matrix element is assumed to have the formP′(i,j)=1/N*[(1-2c)V(i)+c*V(i−1)+c*V(i+1)], where c is a cross talkfactor which is generally small relative to 1. Assuming accurate fan-inoptics 310, the light collected by each detector 312 is given by:

${D^{\prime}(j)} = {{\sum\limits_{i}{{a\left( {i,j} \right)}{P^{\prime}\left( {i,j} \right)}}} = {{\frac{1}{N}\left( {1 - {2c}} \right){\sum\limits_{i}{{a\left( {i,j} \right)}{V(i)}}}} + {\frac{1}{N}c{\sum\limits_{i}\;{{a\left( {i,j} \right)}\left\lbrack {{V\left( {i - 1} \right)} + {V\left( {i + 1} \right)}} \right\rbrack}}}}}$

In some embodiments of the invention, the attenuation values of elements311 set to

${{a^{\prime}\left( {i,j} \right)} = {{{a\left( {i,j} \right)}\frac{1}{1 - {2c}}} - {c*\left\lbrack {{a\left( {{i + 1},j} \right)} + {a\left( {{i - 1},j} \right)}} \right\rbrack}}},$such that the resultant detected values are:

$\begin{matrix}{{D^{''}(j)} = {\underset{\underset{D{(j)}}{︸}}{\frac{1}{N}\left( {1 - {2c}} \right){\sum\limits_{i}{{a\left( {i,j} \right)}{V(i)}}}} - {\frac{1}{N}c^{2}{\sum\limits_{i}{\left\lbrack {{a\left( {{i + 1},j} \right)} + {a\left( {{i - 1},j} \right)}} \right\rbrack\left\lbrack {{V\left( {i - 1} \right)} + {V\left( {i + 1} \right)}} \right\rbrack}}}}} & (1)\end{matrix}$using the equality

${\sum\limits_{i}{\left\lbrack {{a\left( {{i + 1},j} \right)} + {a\left( {{i - 1},j} \right)}} \right\rbrack{V(i)}}} = {\sum\limits_{i}{\left\lbrack {{V\left( {i - 1} \right)} + {V\left( {i - 1} \right)}} \right\rbrack{{a\left( {i,j} \right)}.}}}$As can be seen the right hand element of the right side of equation (1),which represents a non-compensated portion of the optics inaccuracies,decreases with the square of c and is therefore negligible.

In some embodiments of the invention, fan-in optics 310 transfers someof the light of a row j also to the detectors of rows above and/orbelow. Optionally, the attenuation values of matrix 308 are adjusted inorder to compensate for such inaccurate light transfer. In an exemplaryembodiment of the invention, the actual compensated attenuation valuesa′(i,j) of elements 311 are given as a function of the desiredattenuation vales a(i,j), by:a′(i,j)=b*a(i,j)−c1*a(i+1,j)−c2*a(i−1j)−c3*a(i,j+1)−c4*a(i,j−1)+d  (2)in which b, c1, c2, c3, c4 and d are constants, which may be differentfor different i, j. In some embodiments of the invention, forsimplicity, one or more of c1, c2, c3 and c4 are assumed to be zeroand/or equal to another one of the values. In some embodiments of theinvention, for simplicity, b, c1, c2, c3, c4 and/or d have same valuesfor all i and j. Alternatively, in order to achieve additional accuracy,the constants b, c1, c2, c3, c4 and/or d have values which depend on iand/or j.

In some embodiments of the invention, additional terms includingcoefficients of (i+k, j+n) are used in calculating a′(i,j), where k, nare any integer values. Alternatively or additionally, one or more termsof a′(i,j) is taken as a coefficient of other compensated correctedattenuation values. For example, the compensated attenuation valuesa′(i,j) may be calculated row after row, using previously calculatedvalues from previous rows and/or previous columns, i.e., a′(i−k,j−n) inwhich at least one of k, n is positive and the other is non-negative.

In some embodiments of the invention, whenever possible, correctedattenuation values are used instead of non-compensated values. Forexample, the following equation may be used instead of equation (2):a′(i,j)=b*a(i,j)−c1*a(i+1,j)−c2*a(i−1j)−c3*a(i,j+1)−c4*a(i,j−1)+dIn an exemplary embodiment of the invention, the following equation isused:a′(i,j)=b(i,j)*{a(i,j)−c(k,m)*[a(i+1,j)−a′(i−1,j)]−d(k,m)*[a(i,j+1)+a′(i,j−1)]}in which k=floor [i/16] and m=floor [j/16]. Thus, the b coefficient iscalculated separately for each matrix element, while the c, dcoefficients change in blocks of 16×16 elements. It is noted that otherblock sizes may be used.

In some embodiments of the invention, the coefficients used incalculating the compensated attenuation values are determined fromdirect measurement of the inaccuracies in fan-in optics 310 and/orfan-out optics 306.

Alternatively or additionally, it is assumed that the cross talk can berepresented by matrices T_(in) and/or T_(out), such that d′=T_(in) AT_(out)x, instead of d=Ax. A compensation matrix A′=T−¹ _(in) M′ T⁻¹_(out) is optionally used instead of the original matrix A.

Cross talk matrices T_(in) and/or T_(out) are optionally determinedduring calibration by measuring results of predeterminedinput/matrix/output combinations. In some embodiments of the invention,a set of cross talk measurement vectors of the form X(j)=δ(K−j) (whereδ(K−j) is zero unless K=j (e.g., X(j)={0, 0, . . . 1, 0, . . . }) aremultiplied by a matrix A(j,k)=δ(M) *δ(K+1), providing result vectorsY(k). The above described fan-out cross talk coefficient c(M, K+1) isoptionally determined from the resultant Y(k). Similar measurements areoptionally performed to determine the fan-in coefficients. Moreefficient determination methods, which require fewer testmultiplications, may be used in which a few non-zero values of X and Aare used in some or all of the test multiplications.

In some embodiments of the invention, matrices T are simplified into asparse form, such as {{1, d, 0, 0, . . . },{d, 1, d, 0, 0, . . . },{0,d, 1, d, 0, . . . }, . . . } where all “d” may have different valueswhich are small relative to 1, so as to allow for relatively simpleinversion of the matrices T. Alternatively or additionally, iterativeand/or recursive methods are used in inverting matrices T.

Alternatively or additionally to the attenuation values of matrix 308compensating for inaccuracies in fan-out optics 306 and/or in fan-inoptics 310, the attenuation values of matrix 308 are modified tocompensate for irregularities in the VCSELs 304, in the detectors 312and/or in other components of VMM core 204, including matrix 308 itself.As an example, if one light source in VCSELs 304 is stronger then othersources, the matrix elements in the column to which its light is spreadmay be made more opaque to compensate. In another example, if the fan-Inoptics 310 is more efficient for the center columns and the efficiencydeclines towards the edges, the matrix elements 311 are optionally mademore transparent (for the same original matrix values) for the edgepixels. Optionally, the attenuation values of elements 311 of matrix 308have a sufficient dynamic range, allowing compensation for a wide rangeof irregularities.

Alternatively or additionally, the compensation of the inaccuracies isperformed by pre-processor 302 and/or post-processor 314. In someembodiments of the invention, the unit of VMM core 204 (e.g.,pre-processor 302, matrix 308 and/or post-processor 314) used tocompensate for each inaccuracy is selected according to the accuracyachieved by each of the elements. Optionally, when necessary, thecompensation is performed by a plurality of different units of VMM core204.

Optionally, when possible (e.g., the accuracy achievable by differentunits is substantially the same), the compensation is performed bypre-processor 302. Alternatively, when possible, the compensation isperformed by matrix 308. In some embodiments of the invention, whenmatrix 308 changes seldom, the compensation is performed by matrix 308,in order to reduce the number of times the compensation is performed.When matrix 308 changes often, the compensation is optionally performedby pre-processor 302, in order to minimize the amount of calculationsrequired. Optionally, the selection is performed at production based onthe intended use of SPE 102. Alternatively, for each application a unitto perform the compensation is selected.

In an exemplary embodiment of the invention, the cross-talk matrices aredetermined and/or tracked using methods described herein or in the otherPCT application filed on even date, for tracking inter-channelinteractions.

Redundancy

Assuming without loss of generality that the manipulated vectors include256 vectors, matrix 308 optionally includes an array of 256×256 elements311. Alternatively, matrix 308 includes additional elements 311 beyondthose required for performing the matrix multiplication (e.g., beyond256×256). In some embodiments of the invention, matrix 308 includes oneor more rows and/or columns beyond the number of elements in themultiplied vectors (e.g., 256).

Optionally, during operation, only the required number of rows andcolumns are used. In some embodiments of the invention, the redundantelements 311 are used instead of defective elements, due to manufactureor aging failures. Optionally, the rows and/or columns to be used areconfigured after manufacture. In these embodiments, the redundancy isoptionally used to compensate for defective elements 311. If a matrixhaving x extra rows and y extra columns is provided, at least x+ydefects in the matrix can be tolerated, by using one of the replacementrows/columns, in which there is no defect, in place of a row/column thathas a defect. Defective matrix, detector and/or light source elementsare optionally detected during calibration.

Alternatively or additionally, the rows and/or columns to be used at anyspecific time are set by a controller of VMM core 204, so that atdifferent times different elements 311 are used. Optionally, theattenuation values of rows not in use may be updated while the otherrows are being used for matrix multiplication. Thus, when the change ofthe attenuation values of elements 311 is relatively time consuming, itmay be performed without stopping the operation of VMM core 204.

In some embodiments of the invention, fan out optics 306 and/or fan inoptics 310 are programmable, for example, using a controllablerefracting optical element, to shift light from one optical path toanother. Optionally, the light from each VCSEL may be directed to one ofa predetermined number (e.g., 2-4) of adjacent rows of matrix 308. Atany specific time the light paths are optionally set to impinge on thespecific matrix elements 311 which are to be used. Alternatively oradditionally, extra light sources (VCSELs) 304 and/or detectors 312 areprovided. Optionally, in this alternative, each VCSEL is assigned to arespective row of matrix 308. Further alternatively or additionally, aninternal switching unit of VCSELs array 304 controls the electricaldriving amplitudes applied to each of the VCSELs. In some embodiments ofthe invention, the switching speed of the values applied to specificVCSELs 304 is of the order of the time required to change theattenuation values of elements 311, as in many cases the switchingbetween different VCSELs 304 is performed together with changes inattenuation levels of matrix 308.

In some embodiments of the invention, redundant detectors 312 are alsoprovided to compensate for malfunctioning detectors. Alternatively oradditionally, one or more extra light sources are provided to compensatefor malfunctioning VCSELs 304.

In an exemplary embodiment of the invention, when redundant elements(e.g., VCSELs, matrix elements and/or detectors) are provided, at leastsome of the provided elements are produced with a relatively lowquality, in order to limit costs. Alternatively or additionally, alarger percentage of redundant elements is provided for elements thatare relatively cheap (e.g., detectors), while a low percentage ofredundant elements (or not redundant elements at all) are provided forexpensive elements (e.g., VCSELs). In an exemplary embodiment of theinvention, only two or one extra VCSELs are provided due to theirrelatively high cost. Alternatively or additionally, the number ofprovided VCSELs is selected such that the chances of having fewerfunctional VCSELs than required is below a predetermined probability.

In some embodiments of the invention, all of VCSELs 304 and/or detectors312 have the same operation parameters. Alternatively, one or moreVCSELs and/or detectors 312 have different parameters, for example, foruse with matrix elements which require such different VCSELs 304 and/ordetectors 312. Optionally, detectors 312 with different dynamic ranges,are provided for use with matrix elements 311 having differentproperties. Alternatively or additionally, VCSELs 304 with higherintensities than others, are provided for use with matrix elementshaving transparencies above normal. These options are especiallyadvantageous in those embodiments in which VCSELs 304 and/or detectors312 are provided for each element of matrix 308.

In some cases, columns and/or rows close to the edges of matrix 308receive less light than other rows due to off-axis inefficiency of thefan-in and/or fan-out optics. Optionally, mathematical matrix elementsrepresented by elements at the edges of matrix 308 are represented bytwo or more elements, in order to enhance the accuracy and/or effectivebrightness of these elements. For example, each of one or more edge rowsand/or columns may be duplicated. Pre-processor 302 optionally correctsfor the duplication of columns, while post processor 314 corrects forthe duplication of rows. Alternatively, any other suitable correctionmethods may be used.

In some embodiments of the invention, one or more additional matrixcolumns and/or rows, and/or one or more extra detectors and/or VCSELsare provided for on-line calibration and/or sanity checking. Optionally,at least one additional VCSEL 304 is provided for base line estimation,total energy measurement and/or other system analysis or monitoring. Inan exemplary embodiment of the invention, four rows and/or columns areconstantly set to a maximal attenuation value (e.g., 255), forcalibration purposes. Alternatively or additionally, four columns and/orrows are set to a middle attenuation value, e.g., 128. The results ofthe on-line calibration are optionally used by pre-processor 302 and/orpost-processor 314, as described above.

In some embodiments of the invention, matrix elements to which no lightis intentionally projected or intentionally collected from, are kept ina highest attenuation state, to reduce their potential noisycontribution.

Calibration

As described above, in some embodiments of the invention, adjustmentparameters and/or error values are determined in one or more calibrationprocesses. Optionally, the calibration process is performed aftermanufacture. Alternatively or additionally, a calibration process isperformed every time the values of matrix 308 are changed and/orperiodically at a predetermined rate (e.g., every 16 multiplicationcycles) and/or after a predetermined number of matrix multiplications.Alternatively or additionally, calibration is performed for each batchof input vectors and/or for each matrix multiplication performed.Further alternatively or additionally, calibration is performed when theresulting processed data appears erroneous or inconsistent. Optionally,indications on the level of errors are received from an applicationreceiving the processing results. In some embodiments of the invention,calibration is performed when the error level exceeds a predeterminedlevel. Alternatively or additionally, the temperature of SPE 102 orparts thereof is monitored and a temperature dependent calibration isapplied when the temperature changes significantly. Furtheralternatively or additionally, the calibration process is performedwhenever matrix 308 is not in use.

In an exemplary embodiment of the invention, the calibration includesprocessing one or more known vectors in one or more different mannersand comparing results between the vectors and/or to a known result. Aparticular calibration test that may be applied is calculating both atransform and a negative of the transform and adding the two results, tosee if the total is zero. In some embodiments of the invention, acalibration process includes performing a same calculation underdifferent conditions. During operation, the conditions are monitored andaccordingly correction value are selected. Optionally, the specificcorrection value is selected based on an interpolation of the valuesdetermined in the calibration process. The different conditions mayinclude different temperatures. As noted above with regard to historycancellation, temperature calibration may be used. In some cases, it isassumed that various effects maintain a steady state and/or aredependent on data settings and/or on averaged states of the matrix. Tothis end a (relatively) small number of situations are executed on theprocessor and distortions of the results recorded and used topre-correct the data and/or the matrix and/or post process the result,irrespective of the original cause.

In some embodiments of the invention, a short testing procedure isperformed periodically at a relatively high rate and a longer testingprocedure is performed at a lower rate. Alternatively or additionally,along with each multiplication, one or more test procedures, e.g., useof one or more test columns, is performed.

In an exemplary embodiment of the invention, the calibration includesusing an input vector having same values for all the elements of thevector. Optionally, an average value is used for all the input elements,e.g., 128.

In an exemplary embodiment of the invention, calibration comprisesperforming a transform on dummy data (e.g., blank) to determine acorrection matrix. The correction matrix is optionally applied by VPU206 and/or by post-processor 314. Alternatively or additionally, thecalibration results in changing the settings of the optical elements ofVMM core 204, for example the transparencies of elements 311 and/or thelight intensities of VCSELs 304. Alternatively or additionally, theresults of the calibration are used to select redundant matrix, VCSELand/or detector elements and/or to rearrange (or shift) the mapping ofthe elements.

In some embodiments of the invention, pre-processor 302 is adapted togenerate values required for calibration. Optionally, pre-processor 302is adapted to generate a vector formed of all ‘0’ bits, all ‘1’ bitsand/or of a specific numeric value, such as 128 for 8-bit elementvectors. Alternatively or additionally, pre-processor 302 is adapted togenerate a sum vector in which each element is equal to the sum of i(the position in the vector) and V(i) the value of the input vector inposition i. The sum vector is optionally used for static calibration ofpost-processor 314. In some embodiments of the invention, instead ofcalculating the sum vector (or other calibration values) bypre-processor 302 it is calculated by VPU 206.

Folded Path VMM Core

In some embodiments of the invention, VMM core 204 uses a folded opticalpath, for example in order to achieve a more compact structure for VMMcore 204. The optical path is optionally folded along its length and/orits width.

FIGS. 3A and 3B are a schematic side view and a three-dimensional view,respectively, of an optical implementation 380 of VMM core 204, inaccordance with an exemplary embodiment of the invention. In theembodiment of FIGS. 3A and 3B, a reflective attenuation matrix 382 isused. Reflective matrix 382 may be, for example, a GaAs/GaAlAs MultiQuantum Well (MQW) matrix. By using a reflective matrix 382, VMM core204 is generally more compact and possibly easier to manufacture and/orpackage.

In optical implementation 380, VMM core 204 comprises an array of lightsources (e.g., VCSELs 304), with a relatively accurate polarizationcontrol. Methods of manufacturing VCSELs 304 with a relatively accuratepolarization are described, for example, in EP patent publication0,924,820, “Polarization-controlled VCSELs using externally appliedUniaxial Stress” and EP patent publication 0,935,321, “Surface EmissionSemiconductor Laser”, the disclosures of which are incorporated hereinby reference.

Each VCSEL 304 optionally has a respective lenslet 384 (in FIG. 3A, onlyone lenslet is shown, in FIG. 3B, a lenslet array is shown), whichspreads the light from its VCSEL 304 along a plane perpendicular to thepage in FIG. 3A. Optionally, a cylindrical and/or asymmetric lenslet 384is used. The spread out light from VCSELs 304 is then optionallystraightened by cylindrical lenses 386 into parallel light beams. Theparallel light beams from a single light source hit a polarizing beamsplitter (PBS) 388 which allows light polarized light in a firstpolarization to pass through and reflects light with oppositepolarization. As mentioned above, the light from VCSELs 304 isoptionally polarized in a specific direction such that substantially allthe light is reflected by PBS 388. The reflected light is directedtoward reflective matrix 382, such that the light from a single VCSEL304 is directed to a row of the reflective matrix. Each element ofreflective matrix 382 reflects the light impinging on it, whileabsorbing a percentage of the light according to an adjustableattenuation level. The adjustable attenuation level reflects a value ofa mathematical matrix on which vector multiplication is performed, asdescribed above with reference to FIGS. 2A and 2B. The light beamsreflected from reflective matrix 382 are reflected back towards PBS 388.

In some embodiments of the invention, the light passing between PBS 388and reflective matrix 382 passes through a λ/4 polarization changer 393,which shifts the polarization of the passing light. As the light passesthrough polarization changer 393 twice (on the way from PBS 388 and onthe way back to PBS 388), the polarization of the light is switched,such that substantially all the light reflected by reflective matrix 382passes through PBS 388, rather than being reflected by PBS 388. Thelight passing through PBS 388 is directed toward a detector array 390.The light directed toward detector array 390 is optionally fanned in bya cylindrical lens 392, such that all the light from a single column ofmatrix 382 reaches a single detector in array 390.

Alternatively to using polarization accurate light sources, lightsources with a random, but constant polarization, are used. The amountof light actually reaching each of the elements (according to the actualpolarization of each light source) is optionally determined duringcalibration. Accordingly, the results are corrected for the differentpercentages of light loss.

Further alternatively, light sources whose polarization changes in time,are used. A detector array optionally determines the current loss oflight due to the polarization and accordingly the results are corrected.Further alternatively or additionally, a scrambler is applied to thelight, such that it always has a 45 degree polarization. Thus, half thepower of the light is lost, but it may be easy to compensate for theloss. Alternatively, a circular polarization is used. For example,instead of a 45 degree polarization, a clock wise (CW) or counter clockwise (CCW) polarization, is used.

FIG. 4 is a schematic side view of an optical implementation 450 of VMMcore 204, in accordance with another exemplary embodiment of theinvention. Generally, implementation 450 is similar to implementation380 of FIGS. 3A and 3B, so in the following description mainly only thedifferences are discussed. In implementation 450, two substantiallyidentical reflective matrices 382 (marked 382A and 382B) are used, in animplementation which eliminates the need for light sources with arelatively accurate polarization control and/or stability. Inimplementation 450, an array of VCSELs 304 (or other light sources)generate light without polarization control. The light from VCSELs 304is optionally spread out by lenslets 384 (for simplicity only one isshown) and passed through a cylindrical lens 386, toward a PBS 388. PBS388 optionally decomposes the light to its orthogonal polarizations, sothat some of the light (having a specific polarization) passes throughPBS 388, impinging on a row (perpendicular to the page of FIG. 4) ofmatrix 382B. Matrix 382B reflects the light back towards PBS 388 afterattenuating each light beam according to its required attenuation. Apolarization changer 393B optionally changes the polarization of thelight passing between PBS 388 and matrix 382B, such that the lightreflected from matrix 382B is reflected by PBS 388 back toward adetector array 390.

A remaining portion of the light, having a different polarization, fromeach VCSEL 304 is reflected by PBS 388 toward a row (perpendicular tothe page of FIG. 4) of reflective matrix 382A. A polarization changer393A optionally changes the polarization of the light passing betweenPBS 388 and matrix 382A, such that the light reflected from matrix 382Apasses through PBS 388 toward detector array 390. In implementation 450,each mathematical matrix value is represented by two matrix elements, anelement in matrix 382A and a respective element in matrix 382B.

In some embodiments of the invention, for example as is now describedwith reference to FIGS. 5A and 5B, beams with different polarizationsare used to allow a same propagation space to be shared by two or moresets of paths.

FIG. 5A is a schematic side view of an optical implementation 452 of VMMcore 204, in accordance with still another exemplary embodiment of theinvention. In implementation 452, instead of using one full sizereflective matrix, as in FIGS. 3A and 3B, two half size complimentarymatrices 454A and 454B, are used. The use of smaller reflective matrices454, potentially increases the yield of the production process ofmatrices 454 and therefore makes the production of VMM core 204 cheaper.For the following explanation it is assumed, without loss of generality,that the mathematical matrix of VMM core 204 is 256×256 elements andthat each reflective matrix 454A and 454B, has 256×128 elements.

In implementation 452, polarization controlled VCSELs 304 generate lightbeams, each of which is spread out by a lenslet 384 and cylindrical lens386 into 128 parallel beams. The 128 parallel light beams are directedto PBS 388, which has a polarization setting, relative to thepolarization of VCSELs 304, such that the light of each beam is split inhalf, a first half continuing toward reflective matrix 454A and a secondhalf being reflected toward reflective matrix 454B. Thus, the light ofeach VCSEL 304 reaches 256 elements, 128 on each of matrices 454A and454B. In some embodiments of the invention, each row of the mathematicalmatrix is represented by 128 elements on matrix 454A and 128 elements onmatrix 454B. In some embodiments of the invention, as described above,polarization changers 393A and 393B are located between PBS 388 andmatrices 454A and 454B, respectively, such that the light reflected bythe matrices is directed toward a pair of detector arrays 390A and 390B.

The light from PBS 388 headed toward detectors 390A and 390B may beviewed as an array of 256×128 parallel beams, each beam including lightfrom two matrix elements, one on each of matrices 454A and 454B. It isnoted, however, that the light from the different elements in each beamhas a different polarity. The light beams are optionally diverged by acylindrical lens 392, condensing each column of 256 beams into a singlebeam. Thus, after cylindrical lens 392 there are 128 beams, each ofwhich is formed by a pair of different polarity beams.

In some embodiments of the invention, a detector PBS 456 is used toseparate the different polarity beams, so as to direct the light frommatrix 454A toward detector array 390A and the light from matrix 454Btoward detector array 390B. Each detector array 390A and 390B optionallyincludes 128 detectors, together having 256 detectors.

Optionally, one matrix is used for real values and one for imaginaryvalues.

FIG. 5B is a schematic side view of an optical implementation 458 of VMMcore 204, in accordance with still another exemplary embodiment of theinvention. Implementation 458, is similar to implementation 452, butplaces matrices 454A and 454B at an angle relative to PBS 388, so thatthe light beams from the matrices do not overlap. Thus, detector PBS 456(FIG. 5A) is not required.

Reference is now made to FIGS. 6A-6D, which are schematic illustrationsof elements of VMM core 204, in accordance with still another exemplaryembodiment of the invention. For simplicity, FIGS. 6A-6D show a twelveelement vector example. It will be obvious to those in the art that theembodiment of FIGS. 6A-6D may be modified to correspond to substantiallyany vector size.

FIG. 6A is a schematic side view of an optical implementation 472 of VMMcore 204. In implementation 472, rather than organizing VCSELs 304 in aone dimensional array, VCSELs 304 are organized in a two dimensionalarray (e.g., with each dimension having more than one, two, four or morelight sources in it). VCSELs 304 are organized, for example as shown inFIG. 6B, which is a schematic illustration of the organization of VCSELs304.

Each VCSEL 304 generates a light beam 303, which is directed to arespective portion of each of two reflective matrices 358 and 358′,through lenslets 356 and a PBS 388. Alternatively or additionally, thelight beams 303 may be directed to matrices 358 and 358′ through otherelements, such as described above with respect to other embodiments.

In some embodiments of the invention, the expanded light beams arepassed through a collimator and/or a cropper (not shown) so as to formcollimated beams.

FIG. 6C is a schematic illustration of reflective matrix 358. Matrix 358comprises for each light beam 303, an element array 399 which includes aplurality of attenuation elements 398 which absorb a programmedpercentage of the light in the light beam. Each VCSEL 304, andrespective generated light beam, corresponds to a single element of amultiplied vector. Each element 398 optionally corresponds to a singleelement of a mathematical matrix represented by VMM core 204.Optionally, each element 398 corresponds to specific row and columnindices identifying the mathematical matrix element to which the element398 corresponds. The elements 398 of a single array 399 optionallycorrespond to half the elements of a single row of the mathematicalmatrix, the other elements of the row being represented by acorresponding array 399 on matrix 358′. Each element 398 corresponds to

FIG. 6D is a schematic illustration of a detector array 320 inimplementation 472. Detector array 320 optionally includes a number ofdetectors equal to the number of elements in the multiplied andresultant vectors. In some embodiments of the invention, detector array320 is divided into two sub-arrays 327 and 327′ corresponding torespective ones of matrices 358 and 358′. Optionally, each sub-array 327and 327′ has a number of detectors equal to, and organized as, theelements of element arrays 399. The reflected light beams from all ofelement arrays 399 of matrix 358 are optionally directed to sub-array327, in a manner such that all the elements 398 corresponding to asingle column of the mathematical matrix are collected by a singledetector. In a same manner, the reflected light beams from all ofelement arrays 399 of matrix 358′ are optionally directed to sub-array327′, in a manner such that all the elements 398 corresponding to asingle column of the mathematical matrix are collected by a singledetector.

In some embodiments of the invention, a prism element 310 and/ormagnification and imaging lenslet arrays 330 are used to direct thelight from the plurality of element arrays 399 to detector array 320.

Although in FIG. 6C element arrays 399 are shown as having a squareshape, in some embodiments of the invention other shapes, such as arectangular shape, a symmetrical polygon shape and/or a circular shapeare used. The shape of element arrays 399 is optionally chosen so as tominimize the dispersion diameter of beams 303 which are directed ontothe element arrays.

In some embodiments of the invention, elements 398 have a shape chosenaccording to the shape of element array 399 and the number of elementsin the array 399. For example, when arrays 399 have a square shape,elements 398 optionally have a rectangular shape, which easily spans theentire area of array 399. Optionally, the rectangular shape of elements398 is as close as possible to being square so as to minimize the borderlengths between elements 398.

FIG. 6E is a schematic illustration of a reflective matrix, inaccordance with another exemplary embodiment of the invention. In theembodiment of FIG. 6E, element arrays 399 have circular shapes, andelements 388 have circular sector shapes. Alternatively, elements 398 inthe form of concentric rings (and optionally a central circle), areused. Optionally, the widths of the rings are adjusted, such that all ofelements 398 have the same area.

In some embodiments of the invention, as described above, all ofelements 398 have substantially the same area, so as to have equal lightintensities corresponding to equal mathematical values. Alternatively,for example, in order to allow for better element location and/or easierproduction of the reflective matrices, elements 398 of different areasare used. Furthermore, elements 398 of different areas may be used tocorrect for border inefficiencies or other non-uniformity of the opticalsystem. Similarly, in some embodiments of the invention, detectorshaving different areas are used. For example, detectors located at endsof the light array may have larger areas to compensate for the low lightefficiency reaching these elements. In some cases, the elements and/ordetectors are all manufactured to a same size and then selectivelydegraded in size and/or quality to promote uniformity of function.

Optionally, pre-processor 302 (FIG. 1) and/or post-processor 314compensate for the different areas and light intensities. Alternativelyor additionally, the attenuation values of elements 398 are adjusted tocompensate for the different areas.

It is noted that SPE 102 does not necessarily represent a squaremathematical matrix. Rather, substantially any matrix size and shape maybe used, the number of elements of the input and resultant vectors beingadjusted accordingly (not necessarily being equal). Furthermore,regardless of the maximal size of a mathematical matrix that can behandled at once by VMM sub-system 199, in some embodiments of theinvention, smaller and larger vector-matrix pairs may be handled. Forsmaller vectors, the remaining vector elements are optionally set tozero and/or the corresponding light sources are shut off or used forcalibration. The handling of larger vectors was described hereinabove.

One property of some of the above described embodiments is that SPE 102is configurable. This allows for a device with multiple SPEs 102 thatcan be reconfigured and/or adapted for various situations. A particularexample is described below, in which, depending on the processing beingperformed instantaneously by a base station, different SPEs 102 may beconfigured in different ways. In addition, however, the function of asingle SPE 102 can be changed on the fly, for example, allowingdifferent functions to be evaluated, without requiring previouslyprocessed data to be exported outside of SPE 102.

In some embodiments of the invention, a processing system includes aplurality of SPEs 102 organized in series and/or in parallel.Alternatively or additionally, one or more SPEs 102 include a pluralityof VMM sub-systems and/or VMM cores 204 which operate in series or inparallel.

Additional hardware implementations which may be used forelectro-optical core 204 are described in one or more of the followingPCT applications and publications, assigned to Lenslet Ltd. and JTC(2000) Inc.:

PCT/IL99/00479, published as WO 00/72267;

PCT/IL00/00283, published as WO 00/72104;

PCT/IL00/00285, published as WO 00/72107;

PCT/IL00/00286, published as WO 00/72108;

PCT/IL00/00284, published as WO 00/72106;

PCT/IL00/00282, published as WO 00/72105;

PCT/IL00/00671, published as WO 02/17329;

PCT/IL01/00331, published as WO 01/78261;

PCT/IL01/00333, published as WO 01/78011;

PCT/IL01/00334, published as WO 01/78012

PCT/IL/01/00332, published as WO 01/7773; and

PCT/IL01/00398, published as WO 01/84262.

The disclosures of all of these applications are incorporated herein byreference. In particular, some of these applications describeimplementations for splitting, processing and/or collecting light, forVMM architectures and for various useful optical components, such asmatching filters and sign extractors. In particular, it is noted thatsome of the above implementations are limited with respect to the typeof transform that can be applied (e.g., FFT-derived). A particularimplementation uses leaky and scattering light pipes to spread lightfrom a point source from one end of the pipe along a line along the sideof the pipe and similar light pipes to collect light along the pipe tothe pipe end. Other methods known in the art may be used as well.Another exemplary application uses a cylindrical lens to spread thelight from a linear array to a 2D plane. Another implementation useslinear detector elements that detect light form a line of matrixelements. Various OVMM (optical vector matrix multiplier) devices havebeen described in the art, for example, the well known Stanford OVMM.

Exemplary Op Code Set

In an exemplary implementation, a programming environment is providedfor the SPE. In one example, an environment similar to the well knownMatLab® environment is used. Optionally, the commands are designed sothat each command is naturally decomposed to the hardware commands forthe various components of the SPE. Alternatively, a high level compilerand/or optimizer may be provided. In an exemplary embodiment of theinvention, 7 data types are provided for use with the commands: Scalar,Complex scalar, Vector, Complex vector, Matrix, Complex matrix, Booleanvector.

1. Scalar—A real number with length of 1.

2. Complex scalar—A number that contains real and imaginary parts withlength of 1.

3. Vector—Column vector that contains real numbers. Length is less orequal to 256 (n<256, n is even).

4. Complex vector—An even-length vector that contains real numbers.Length is less or equal to 256 (n<256, n is even). Optionally,odd-indexed numbers are the real part, and the even-indexed numbers arethe imaginary part.

5. Matrix—Matrix that contains real numbers. Size is less or equal to256 (n,m<256, n is even).

6. Complex matrix—Matrix that contains real numbers. Size is less orequal to 256 (n,m<128). As noted above, real numbers may be arranged,for example in odd columns or in odd rows.

7. Boolean vector—Column vector that contains real numbers. Length isless or equal to 256. The real values used may depend on theimplementation, for example as noted above.

The following table summarizes the provided instructions. Differentfunction's inputs and outputs have different data types. This tablesummarizes the inputs and outputs of every function in the simulator,each of which is optionally a hardware instruction (e.g., possiblyincluding pre- and post-processing steps:

TABLE I Inputs Outputs Function name Module Type 1 Type 2 Type 1 Type 2VmmMatrixLoad VMM Matrix VmmVectorMult VMM Vector Vector Scalar2VectorVPU Scalar Vector CmplxScalar2CmplxVector VPU Scalar VectorVectorInsertDown VPU Vector Scalar Vector Scalar VectorInsertUp VPUVector Scalar Vector Scalar VectorScalarAdd VPU Vector Scalar VectorVectorScalarMult VPU Vector Scalar Vector VectorScalarSub VPU VectorScalar Vector VectorShift VPU Vector Scalar Vector ZeroVector VPU ScalarVector VectorSign VPU Vector Vector VectorAbs VPU Vector VectorVectorMov VPU Vector Vector VectorVectorAdd VPU Vector VectorVectorVectorMult VPU Vector Vector VectorVectorSub VPU Vector VectorVectorLogicNot VPU Boolean Boolean Vector vector VectorVectorCompEqualVPU Vector Vector Boolean vector VectorVectorCompGreat VPU Vector VectorBoolean vector VectorVectorCompLess VPU Vector Vector Boolean vectorVectorVectorConditionalChoose VPU Vector Vector Vector CmplxVectorBuildVPU Vector Vector Complex vector CmplxVectorConj VPU Complex Complexvector vector CmplxVectorMult_1i VPU Complex Complex vector vectorCmplxVectorSeparate VPU Complex Vector Vector vectorCmplxVectorCmplxVectorMult_Imag VPU Complex Complex Complex vectorvector vector CmplxVectorCmplxVectorMult_Real VPU Complex ComplexComplex vector vector vector CmplxVectorImagVectorMult VPU ComplexComplex Complex vector vector vector CmplxVectorRealVectorMult VPUComplex Complex Complex vector vector vector

It should be noted that some instructions write the output to a globalvariable and some instructions have only one input (e.g., have a blankcolumn). Also, the instruction “VectorVectorConditionalChoose” has thirdinput with type Boolean vector.

A short description of the instructions follows.

VmmVectorMult—Multiply Vector by Matrix.

-   Syntax Z=VmmVectorMult(X)-   Operands X is Vector type. Vector elements are 8 bits.-   Description Multiply Matrix by Vector. The Matrix is taken from    Matrix memory. The result is multiplied by 2^(−Gain).Gain is taken    from memory (stored when Matrix is loaded).-   Mathematically Z=(Matrix*X)/2Gain    VmmMatrixLoad—Load matrix-   Syntax VmmMatrixLoad(Matrix,Gain)-   Operands Matrix is Matrix type, No output. Matrix elements are 8    bits.-   Description Matrix is loaded to matrix memory. Matrix gain is stored    in memory.    VectorVectorAdd—add two real vectors-   Syntax Z=VectorVectorAdd(X,xn,Y,yn,zn)-   Operands X, Y and Z are Vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   zn is the bits number of the result.-   Description X is added to Y (element wise) and the result is set at    Z.    VectorVectorSub—subtract two real vectors-   Syntax Z=VectorVectorSub(X,xn,Y,yn,zn)-   Operands X, Y and Z are Vector type.-   xn is the bits number of the first operand-   yn is the bits number of the second operand-   zn is the bits number of the result-   Description Y is subtracted from X (element-wise) and the result is    set at Z.    VectorVectorMult—multiply two real vectors-   Syntax Z=VectorVectorMult(X,xn,Y,yn,zn)-   Operands X, Y and z are Vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   zn is the bits number of the result.-   Description X and Y are multiplied (element-wise) and the result is    set at Z.    VectorScalarAdd—add real vector with real scalar-   Syntax Z=VectorScalarAdd(X,xn,Y,yn,zn)-   Operands Y is Scalar type. X,Z are Vector type.-   xn is the bits number of the first operand-   yn is the bits number of the second operand-   zn is the bits number of the result-   Description Y is added to all elements in X. The result is set at Z.    VectorScalarSub—subtract real scalar from real vector.-   Syntax Z=VectorScalarSub(X,xn,Y,yn,zn)-   Operands Y is Scalar type. X, Z are Vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   zn is the bits number of the result.-   Description Y is subtract from all elements in X. The result is set    at Z.    VectorScalarMult—multiply real scalar by real vector.-   Syntax Z=VectorScalarMult(X,xn,Y,yn,zn)-   Operands Y is Scalar type. X, Z are Vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   zn is the bits number of the result.-   Description Y is multiply by all elements in X. The result is set at    Z.    VectorInsertUp—insert element to vector from top.-   Syntax [Z,Out]=VectorInsertUp(X,In,xn)-   Operands In,Out are Scalar type. X, Z are Vector type.-   xn is the bits number of the first, Second operands and the result.-   Description Shifts all elements down once, insert a new element to    top of X and take out one element from the bottom of the vector.    VectorInsertDown—insert element to vector from bottom.-   Syntax [Z,Out]=VectorInsertDown(X,In,xn)-   Operands In,Out are Scalar type. X, Z are Vector type.-   xn is the bits number of the first, Second operands and the result.-   Description Shifts all elements up once, insert a new element to    bottom of X and take out one element from the top of the vector.    VectorShift—arithmetic shift elements in vector-   Syntax Z=VectorShift(X,xn,N,yn)-   Operands N is Scalar type 8 bits. X, Z are Vector type.-   xn is the bits number of the first operand and the-   yn is the bits number of the results-   Description All elements in X are multiply by 2^(N).N can be    negative which means arithmetic right shift. The result is set at Z.    VectorSign—passes sign of elements in vector.-   Syntax Z=VectorSign(X,xn)-   Operands X, Z are Vector type.-   xn is the bits number of the first operand.-   Description Return 1 for positive numbers, −1 for negative numbers    and 0 for zero numbers.-   Means the result is always 2 bits. The result is set at Z.    VectorAbs—returns Absolute value of Vector elements.-   Syntax Z=VectorAbs(X,xn,yn)-   Operands X, Z are Vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the results.-   Description Returns Vector Z with the absolute values of each    element in Vector X.    Scalar2Vector—copy scalar to vector-   Syntax Z=Scalar2Vector(X,xn)-   Operands X is Scalar type, Z is Vector type.-   xn is the bits number of the first operand and the result-   Description X is copied to a vector, size 256.    CmplxScalar2 CmplxVector—Complex Scalar to Complex Vector.-   Syntax Z=CmplxScalar2 CmplxVector(Xr,Xi,xn)-   Operands Xr is a Real-part Scalar type.

Xi is a Imaginary-part Scalar type.

-   Z are Complex vector type.-   xn is the bits number of the first operand and the result elements.-   Description Places Real-part and Imaginary-part Scalars into a    Complex Vector in the odd and even indexes, and duplicates it to    receive a 256-elements Vector.    ZeroVector—reset vector-   Syntax Z=ZeroVector(xn)-   Operands Z is Vector type.-   xn is the bits number of the first operand and the result-   Description Z is set to zero, size 256.    VectorMov—copy vector-   Syntax Z=VectorMov(X,xn,yn)-   Operands X, Z are Vector type.-   xn is the bits number of the first operand-   yn is the bits number of the results-   Description Copy all elements in X to Z.    CmplxVectorBuild—Complex vector constructor.-   Syntax Z=CmplxVectorBuild(X,xn,Y,yn,zn)-   Operands X, Y, Z are Complex vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   zn is the bits number of the result.-   Description Build new Complex vector from X and Y. From X the    odd-indexed elements (Real part) and from Y the even-indexed    elements (Imaginary part).

Mathematically Z=real(X)+j*imag(Y)

CmplxVectorSeparate—Construct real and imaginary vectors.

-   Syntax [X,Y]=CmplxVectorSeparate(Z,zn)-   Operands X,Y, Z are Complex vector type.-   zn is the bits number of the first operand and the results.-   Description Construct two complex vectors.-   X odd-indexed elements are copied from odd-indexed elements of Z.-   X even-indexed elements are set to zero.-   Y even-indexed elements are copied from even-indexed elements of Z.-   Y odd-indexed elements are set to zero.-   Mathematically: X=real(Z); Y=imag(Z);    VectorLogicNot—logic not operation on vector elements.-   Syntax Z=VectorLogicNot(X)-   Operands X, Z are Boolean vector.-   Description performs logical ‘not’ operation on every Boolean    element in the vector.    VectorVectorCompEqual—element-wise compare if equal.-   Syntax Z=VectorVectorCompEqual(X,xn,Y,yn)-   Operands Z is Boolean vector type. X,Y are Vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   Description compare element-wise respectively between two vectors.    For every couple that is identical put 1 in the result vector, 0    otherwise.-   VectorVectorCompGreat—element-wise compare if greater.    Syntax Z=VectorVectorCompGreat(X,xn,Y,yn)-   Operands Z is Boolean vector type. X,Y are Vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   Description compare element-wise respectively between two vector. If    Xi is greater than Yi put 1 in the result vector, 0 otherwise.    VectorVectorCompLess—element-wise compare if smaller.-   Syntax Z=VectorVectorCompLess(X,xn,Y,yn)-   Operands Z is Boolean vector type. X,Y are Vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   Description compare element-wise respectively between two vector. If    Xi is smaller than Yi put 1 in the result vector, 0 otherwise.    VectorVectorConditionalChoose—conditional vector constructor.-   Syntax Z=VectorVectorConditionalChoose(X,xn,Y,yn,zn,Condition)-   Operands Condition is Boolean vector type. X,Y,Z are Vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   zn is the bits number of the result.-   Description Build new vector from X and Y according to Condition    Boolean vector. If condition is 1 Zi=Yi, else Zi=Xi    CmplxVectorConj—Vector conjugate.-   Syntax Z=CmplxVectorConj(X,xn)-   Operands X, Z are Complex vector type.-   xn is the bits number of the first operand and the result.-   Description Conjugate each complex number. Perform negate on each    even-indexed element.

Mathematically Z=conj(X)

CmplxVectorMult_(—)1i—multiply by j.

-   Syntax Z=CmplxVectorMult_μl(X,xn)-   Operands X,Z are Complex vector type.-   xn is the bits number of the first operand and the result.-   Description Multiply a vector by j. Switch odd & even-indexed    elements. Negate odd-indexed elements.-   Mathematically Z=j*(X)    CmplxVectorCmplxVectorMult_Real—Real Complex Multiplication.-   Syntax Z=CmplxVectorCmplxVectorMult_Real(X,xn,Y,yn,zn)-   Operands X,Y,Z are Complex vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   zn is the bits number of the result.-   Description Multiply a Complex vector by a Complex vector. Take the    real part of the result (equivalent to Real Instruction). The    multiplication is done in three stages:-   1. Element-wise multiply the 2 vectors.-   2. Subtract 2 companion elements (the even-indexed from the    odd-indexed). Put the result in the odd indexed place.-   3. The even indexed places of the result are set to zero.-   Mathematically Z=Real(X*Y)    CmplxVectorCmplxVectorMult_Imag—Imaginary Complex Multiplication.-   Syntax Z=CmplxVectorCmplxVectorMult_Image(X,xn,Y,yn,zn)-   Operands X,Y,Z are Complex vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   zn is the bits number of the result.-   Description Multiply a Complex vector by a Complex vector. Take the    imaginary part of the result (equivalent to Imag Instruction). The    multiplication is done in four stages:-   1. Switch real and imaginary elements in Y.-   2. Element-wise multiply the 2 vectors.-   3. Add 2 companion elements and put the result in the even indexed    place.-   4. Set the odd-indexed elements to zero.-   Mathematically Z=Imag(X*Y)    CmplxVectorRealVectorMult—Complex Multiplication by real.-   Syntax Z=CmplxVectorRealVectorMult(X,xn,Y,yn,zn)-   Operands X,Y,Z are Complex vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   zn is the bits number of the result.-   Description Multiply a Complex vector by a Complex vector. The    second vector is set to real (the even indexed elements are referred    as zero). The multiplication is done in two stages:-   1. In the second vector, Copy odd-indexed elements to even-indexed    places.-   2. Result is element-wise multiplication of the 2 vectors.-   Mathematically Z=X.*Real(Y)    CmplxVectorlmagVectorMult—Complex Multiplication by Imaginary.-   Syntax Z=CmplxVectorlmagVectorMult(X,xn,Y,yn,zn)-   Operands X,Y,Z are Complex vector type.-   xn is the bits number of the first operand.-   yn is the bits number of the second operand.-   zn is the bits number of the result.-   Description Multiply a Complex vector by a Complex vector. The    second vector is set to imaginary (the odd indexed elements are    referred as zero). The multiplication is done in four stages:-   1. In the second vector, Copy even indexed elements to odd indexed    places.-   2. Element-wise multiply the 2 vectors.-   3. Switch the companion elements.-   4. Negate odd-indexed elements.-   Mathematically Z=X.*(j*(imag(Y)))

As noted in the instruction descriptions, there is a limit of precision.In an exemplary embodiment of the invention, commands are also providedfor casting a value to have a limited bit resolution and/or allow it tobe saturated to an upper or lower limit when it is too high.

In an exemplary embodiment of the invention, a simulation system isprovided to try out commands and programs before actually programmingthe SPE. Optionally, such a simulator is written in MatLab. Such asimulator may be useful for calculating expected error values and/oroptimizing (manually) programs. In an exemplary embodiment of theinvention, the simulation (or a real system) includes a profiler, which,for example, keeps track of which commands were executed, their order,their relative occurrence and/or precision. The data is optionally shownin graph form and/or in table form.

In an exemplary embodiment of the invention, the profiling informationis stored in a table with the following format, with zero precisionrepresenting “don't care”:

Operation index code Input_1 # of bits Input_2 # of bits Output # ofbits 17 7 0 7 16 4 0 4 15 3 6 7 18 8 8 16Exemplary Application

FIG. 7 is a schematic block illustration of a UMTS system 400 suitablefor implementation using a module as described in FIG. 2. A plurality ofcellular telephones 408 and 408′ communicate with a base station 402,that is connected to further system components 420 (not described indetail). Telephone 408 communicates with base station 402 via an uplink412 and a downlink 414. However, in a typical situation, such as thepresence of a nearby building 410, there is multi-path propagation,shown as a secondary uplink 418 and a secondary downlink 416 (two shown,many more typically exist). As the uplink and downlink may use differentfrequencies, the indirect paths are often different for the uplink andfor the downlink.

Base station 402 includes an antenna 406 (shown generally connected)that optionally includes multiple component antennas (described below).A processing section of base station 402, which can include VMMprocessors such as SPE 102 (FIG. 1), comprises a receiver 422 forreceiving transmissions from telephone 408, a decoder 424 for extractingthe information carried by the transmission, a processor 426 thatprocesses the data (e.g., collecting call statistics and power control)and communicates with further components 420, an encoder 428 thatprepares data for transmission and a transmitter 430 that generatessignals to be sent to telephone 408.

In an exemplary embodiment of the invention, the VMM processors are usedfor processing in one or more of the above described components. In themethods described bellow. The matrix is a matrix portion of a VMMmultiplier. Various other processing activities described may beperformed, for example, by a VPU, a DSP, a host CPU or separatehardware/software.

In an exemplary embodiment of the invention, system 400 is a wide bandCDMA system. In CDMA systems, each bit of data is spread out in time asa plurality of chips. Chip streams from different users share a sametime and frequency space, but use pseudo-orthogonal codes todifferentiate between the different users.

FIG. 8 is a schematic diagram of an information transmission andreception process 500, in accordance with an exemplary embodiment of theinvention, and based on the UMTS standard. Different implementations mayuse different values than described below. One or more bit-channels(e.g., speech and data) are encoded (502), optionally, adding errorcorrection information. If multiple channels are provided, they areinterleaved (504). At 506, the data bits are spread, typically by WalshSequences, to encode each bit as a series of chips, the length of theWalsh sequence at the spreading/encoded data being defined by aspreading factor. At the same time, a control bit stream is spread withthe Walsh sequence typically with a fixed length of (e.g., 256) and thesame Walsh sequence is used for all users. For voice, for example, onebit of data has one control bit associated with it, and some of thecontrol bits are known pilot bits. Walsh encoding also separatesmultiple channels, if they exist, into orthogonal values. In oneexample, 1 bit of voice data is spread into 256 (complex valued) chips,a spreading factor of 256. 1 bit of computer data, however, may bespread into 64 chips, a spreading factor of 64. Typically, a fixed chiprate is desired, and the spreading factor is thus determined by theratio between the bit rate and the chip rate. The spreading factor ofthe control channel is typically fixed.

At 507, the chips are scrambled by multiplying each chip with an elementfrom a (typically user specific) pseudo-random sequence. These sequenceshave the property that, different sequences have low correlation (e.g.,orthogonal) to each other so that sets of chips from different users canbe separated at reception. Two types of sequences are in general use,long sequences (38400 chips) and short sequences (256 chips). The lengthof the sequence denotes the number of elements after which the sequencerepeats itself.

At 508-514, the chips are QPSK modulated, converted to analog,up-converted to RF frequencies (in some cases to IF frequencies),amplified and transmitted.

At 516-522, base station 402 receives a signal, amplifies it andgenerates a QPSK demodulated stream of complex chip values. Optionally,the stream is oversampled, for example 8 times, to allow variousalignment processes, as described below, to be better applied.

At 524, a rake receiver, one example of which is described below,reconstructs the series of chips sent by each telephone 408. Then, thechips are decoded (526). In an exemplary embodiment of the invention,the reconstruction and decoding is performed for multiple users at asame time.

FIG. 9 is a schematic block diagram of an exemplary rake receiver 600,suitable for example, for reference 524 of FIG. 8. In an exemplaryembodiment of the invention, rake receiver 600 performs the followingtwo types of tasks:

(a) extract data including, for example, combining paths and separatingchannels, under conditions of inference between multiple paths of a sameuser and multiple users; and

(b) find, lock onto and track multiple paths for a same user.

As noted above, a single user/base-station pair may be communicatingover multiple paths. Not only do the multiple paths interfere with eachother, but if the base station is to track a single path for a user, itwould be desirable that the tracked path be reliable. In an exemplaryembodiment of the invention, instead of tracking a single path, multiplepaths for a single user are identified, tracked and the contributionsfrom reliable tracks utilized to extract the data.

In an exemplary embodiment of the invention, the general behavior ofreceiver 600 is as follows. Receiver 600 first detects the relevantpaths for a user. The paths are combined and corrected to form acomposite path and the (possibly at a later stage) the data in thecomposite path signal is extracted. Prior to decoding the composite pathmay be corrected again, for example, to correct phase shifts.

The data from demodulator 522 is processed for several users 602 inparallel. For a single user 602, data from demodulator 522 is passed toa searcher 604. Searcher 604 correlates the incoming signal with thepseudo-random chip series for each user, based on the knowledge that a“1” value pilot bit is available in the control portion of eachtransmission. The result is a series of peaks, each peak of whichcorresponds to a path of transmission of data from a user.

In an exemplary implementation, a correlation matrix comprises, in eachrow, a set of chips for a single user, based on that user's pseudorandom code. The correlation is a running correlation on an intervaldependent of possible path delays. For example, proportional to the cellsize and cell topology (e.g. existence of reflective buildings). Eachstep can move one sample ahead (e.g., 8 samples per chip), or a wholeinput vector less an optional overlap between vectors. It is notedhowever, that to detect a correlation it is not necessary to compare anentire chip set, however, comparing an entire chip set does increase theprobability of success and reduce error rates. In an exemplaryembodiment of the invention, only part of the slot is searched. Rather,only a few bits are searched and then the other bits are tracked.

In an exemplary embodiment of the invention, the correlation searchesover all the control bits in a frame and ignores data bits. Possibly,QPSK demodulator 522 outputs two steams of bits, I and Q, one of whichis data and one of which is control. Alternatively or additionally, onlyknown pilot pits are correlated. The various user chip sets, etc. areoptionally stored in the memory of module 102.

As noted, each user may be transmitting on multiple paths. In anexemplary embodiment of the invention, each path is detected, trackedand possibly used. A finger 608 is assigned to each path. Each suchfinger includes a signal corrector 610 that corrects phase and amplitudeof the signal, a delay 612 that shifts the entire signal (an integernumber of sub samples) and a weight 614 that assigns a measure ofreliability for the path covered by the finger. The contributions fromall the fingers of this user are added together at an adder 616 toprovide a composite signal that will be later decoded.

In an exemplary embodiment of the invention, the correlation process isused to detect which paths have a significant signal. By tracking, overa frame or between frames, how often the path is detected, a measure ofreliability of the path may be discerned. Paths with a strong signalsare selected to be covered by a finger. For example, 8 or 16 fingers peruser may be provided. While some of this may utilize the DSP portion ofmodule 102, in an exemplary embodiment of the invention, dedicated peakfinding hardware is used to detect correlation peaks and the VMM is usedto perform the correlation operation. It should be noted that once apath is found, depending on the system assumptions and/or parameters, asecond search or alignment may be performed only in a next frame orwithin the same frame. In an exemplary embodiment of the invention, thepaths that are found to be stronger are used to update the profile ofpaths for the finger detector, if a separate matrix is used for fingerprocessing.

The resulting chip stream is then decoded, for example, to determine ifthe data bit value is 1 or 0. First, however, the phase shift, caused bythe radio transmission, between the data and control bits is corrected.This can be done by comparing the actual extracted pilot bits to thetrue values of the pilot bits (e.g., channel estimation). Typically, thephase shift between known pilot bits is relatively constant (e.g.,Doppler shifting has a small effect over such intervals) The sign of thebit combining result (e.g., the sum of all the fingers) determines thedecoded bit value. This is the opposite process to the Walsh spreading,in which the sign of the Walsh sequence is defined by the bit valuee.g., a “0” value of the bit is converted into “1” multiplied by theWalsh sequence, and a “1” value of the bit is converted into “−1”multiplied by the Walsh sequence (e.g., an inverted sequence).

FIG. 10 is a flow diagram 700 for a rake receiver and decoder, inaccordance with an exemplary embodiment of the invention, in which ahexagon represent a step of partial or complete matrix update. At 702 asearch matrix is loaded, for example, with two rows for each user, onefor imaginary values of the pseudo-random sequence and one for realvalues (e.g., control and data bits, respectively). The two rows may be,for example, consecutive or a block of real rows is followed by a block(e.g., all) the imaginary rows. In an exemplary embodiment of theinvention, the matrix holds the pseudo random sequence and the VCSELsare modulated by the Walsh code. As different spreading factors may havedifferent Walsh sequence lengths, it is more convenient toWalsh-modulate the input rather than the matrix. The matrix isoptionally sub-sampled to match the input. At 704, the matrix is appliedto some or all of the data in the frame, yielding a plurality of paths.For example, one bit may be searched for each nine bits that aretracked. At 706 a new, phase correction, matrix is inputted and then at708 a phase correction is determined that best corrects the data to haveseparation of control bits and data bits (e.g., pilot bits). In anexemplary embodiment of the invention, the matrix used for fingerdetection is constructed according to the results of the searching phaseand the spreading factor that is indicated in the control bits.Generally, the data's spreading factor is known at the end of the frame.Only that spreading factor is used in the matrix for a single bitcorrelation, at this time (e.g., for some spreading factors, part of thematrix is unused, or used for redundancy and/or improvement of signal tonoise). Users with same spreading factors may be grouped together in asame matrix.

In one embodiment of the invention, where long codes are used, thematrix values are updated as the input progresses along the frame, foreach new bit interval a different segment from the long code is loadedto the matrix.

At 710, a finger specific matrix is loaded, with codes generated by acode generator 714 (e.g., if long codes are used and search is not onall of frame). The delays, etc. are optionally embodied in the valuesinserted into the matrix and their shifting, for example assuming ashort shift (e.g. <32 chips). It should be noted that once the phase isdetermined, sub-sampling of the data is not required in all embodimentsand the sub-sampling may be removed, for example, by interpolation oraveraging. At 712 decoding for multiple fingers is applied, as notedabove, using, for example, path corrections as determined at 704. Codegenerator 714 generates further codes for later parts of the frame.

In an exemplary embodiment of the invention, when a low density of usersis being serviced at the base stations, long codes are used. However,once high capacity is required, short codes and multi-user detectionmethods, described below, may be used.

The above description of a rake receiver has focused on a case where asingle spreading factor is used. However, this is not always the case.For example, two or more spreading factors may coexist. In this case,finger decoding comprises adding together the contributions from fewerthan 256 chips for a single bit. In an exemplary embodiment of theinvention, only a subset of the matrix row elements are used.Alternatively, if the users cannot be arranged so each matrix performsonly same size spreading factors (and in other cases, same lengthcodes), some of the processing may be performed on a different matrixfor those cases. Alternatively, the VCSEL values and matrix values areduplicated, to provide a stronger signal. Alternatively or additionally,only matrix values are duplicated.

In one example of a low spreading factor, such as 4, there are 64 bitsin 256 chips. By arranging the matrix in a triangular way, more then onebit can be decoded at the time. It should be noted however, that if alow spreading factor is used, the number of users is generally reducedsince the total capacity (bits/sec) for a cell is typically interferencelimited and there is a trade off between the data rate and the power.Power increase compensates for rate increase (e.g., fewer chips perbit). The factor of increase is called the coding gain. Therefore, ifone or more users is using a small spreading factor then many otherpotential users are not active and their resources can be used fordecoding of more then one bit at the same matrix.

DSP 214 is optionally used for choosing only a few paths among all thepeaks that the peak detector has detected and/or for calculating phaserotation from the quadrate components.

The decoding operation is a match filter process (filtering) performedon the decoding bits. The combining is by aligning the different pathsof filtered data and summing using a Maximum Ratio Combining (MRC)method (the contribution of each path is multiplied by its amplitude andthe right weight). In an exemplary embodiment of the invention, eachpath has an associated DSP process, which associates a weight dependingon the reliability of the path. Unreliable paths are dropped and newlyfound reliable paths are added and assigned a weight. Any weightassignment method, for example as known in the art may be used.

Alternatively, Equal Gain Combining is used, where all pathscontributions are considered equally. To generate the matrix for thisphase, the users are divided into groups according to their spreadingfactor. Each such group can have its own matrix. The rows of the matrixwill have the length of the spreading factor and one bit at the timewill be decoded. This operation is repeated for each spreading factorseparately (e.g., for all 7 possible spreading factors (4 . . . 256).Alternatively, the users are grouped so that only some of the matrixesneed to be constructed and used. In addition, more than one bit can bedecoded per cycle, as explained above.

In an exemplary embodiment of the invention, a search for a correlationof an input vector with a small pattern comprises loading the matrix soeach line includes the pattern, shifted by some amount (the arrangementmay be non-monotonic, to reduce cross-talk). The result indicates wherein the input there is the best match. A peak detector is optionallyapplied on the result vector to determine this location. In an exemplaryembodiment of the invention, when the input vector is longer than thelength of the device input, a partial overlap between consecutive inputvectors sections, for example about or more than the pattern size, maybe provided. Alternatively, a shorter or no overlap is provided.

As the number of users increases, the inference between users becomessignificant. In addition, a single user may have many more paths than ispractical to model using a multi-finger rake receiver. Anotherinterference cause is neighboring cells, however, this interference isconsidered to be lower then the intra-cell interference. Cancellation ofthis interference can be done in the same way as the intra-cellinterference (e.g., using MUD, described below) if the neighboring cellstransmit their data bits (e.g., using an Iur connection).

In an exemplary embodiment of the invention, a MUD (multi userdetection) method is used to remove the contribution of interferingsignals, so that the desired signal is more discernible. In an exemplaryembodiment of the invention, the MUD method is used to determine thebest correlation for each user, on the data bits, rather than on thecontrol bits. After the best correlation are determined, multi-fingerdetection may be applied as above.

Three MUD algorithms are commonly used, although many have beensuggested and are contemplated as being suitable for some embodiments ofthe invention.

A first method is based on maximum likelihood. The signal is estimatedbased on channel estimation (e.g., delay, phase, amplitude and/orDoppler shift) and the data bits are permuted to determine a best matchto the input signal. This method is generally considered very complex.

A second method is decorrelation detection, in which the signal isassumed to be a convolution of the data bits with a cross-correlationmatrix with noise added. If the noise is ignored, the data bits can bereconstructed by inverting the cross-correlation matrix and multiplyingit by the input signal, or by solving a set of equations.

A third method is an iterative process of eliminating inferencecontributing paths by subtraction of a user signal estimation. In anexemplary embodiment of the invention, multiple interfering paths areremoved in parallel. The value of a single bit can be estimated, forexample, by subtracting the effect of a previous, a current and a nextbit, of a plurality of paths.

In an exemplary embodiment of the invention, a plurality of strongsignals are detected in the input signal, for example using a detectionmatrix as described above. After detection, the signal value is decoded,the temporal delay is determined and the signal strength is estimated.All of these have been described above with reference to a general rakereceiver, except for signal strength estimation which can be estimateddirectly from the extracted signal. The thus estimated signals are thenaligned in time and corrected to have the estimated amplitude and thensubtracted from the input signal, hopefully, having a similar effect asif the interfering signals were never present. It should be noted that aplurality of signals can be estimated and subtracted in parallel.Alternatively or additionally, the VMM processor may be used to generatethe estimated signals in parallel, by multiplying the data estimates ofeach signal by their suitable spreading functions. Amplitude correctionmay be achieved, for example, using a VPU operation.

In an exemplary embodiment of the invention, when an interfering databit cannot be estimated with a desired level of confidence, chipsrelating to that bit are not subtracted out from the input signal or afraction of the bit value is subtracted. The fraction may be related tothe confidence level.

In an alternative exemplary embodiment of the invention, a matrixapproach is used to establish and then to solve the equations of thesecond (decorrelation) method. In an alternative embodiment, theiterative method is used. However, at each iteration, a plurality ofpaths (e.g., 128) are removed.

In an exemplary embodiment of the invention (using the second method),the multi-user detection comprises of the following steps:

(a) multipath combining, for example using MRC;

(b) calculation of cross-correlation coefficients and matched filterresults (for example using an over sampling of 4);

(c) solving the resulting set of equations, for example using iterative,non-stationary method such as CG, CGS, BiCGSTAB or GMRES; and

(d) optionally verifying the results using a DSP.

In an exemplary embodiment of the invention, MUD methods are appliedonly to the significant bits, while setting the insignificant bits tozero. This may improve the accuracy and/or reduce error propagation,especially the magnification of the effect of Gaussian noise.

FIG. 11 is a schematic data flow diagram 800 of a rake receiverincluding MUD, in accordance with an exemplary embodiment of theinvention. Portions 702-708 are the same as in simple rake receiver.However, the finger detectors are integrated into the MUD method. At 802a matrix for calculating cross-correlation is loaded and thecross-correlation factors are calculated (804). The matrix is switchedso as to calculate multiple cross-correlation coefficients.

At 806, a matrix for calculating a matched filter is loaded. Values forthe matched filter are calculated using a code generator 808. Thisprocess is repeated until all the desired matched filters arecalculated. At 812 matrixes for solving a set of simultaneous equationsare loaded. The equations are solved at 816, with the help of a codegenerator 814 for loading the equations. As noted above, this process isiterative.

In some cases, two paths (of a same or different users) have asignificant (e.g., of the order of a length of a bit) delay betweenthem. If detection is applied to two such users using a same matrixoperation, the input vector that contains a whole bit for one user maycontain contributions (even equal) from two bits for another user.However, the delay for each user/path is generally tracked. In anexemplary embodiment of the invention, detection is applied to anartificial signal, in which contribution from a fractional bit isprovided. For example, an input signal may be correlated with a signalthat contains the second half of a code for a “0” value and a first halfof a code for a “1” value, for a particular user. For example, sixdifferent signals may be used for each user, “01”, “00”, “10” and “11”,with the dividing line being determined by the tracking and in somecases, degenerating in to “0” and “1”. Optionally, only “00” and “01”are used, as the other two signals have a correlation with the oppositesign for any particular case.

In some cases, it is difficult to detect fraction parts of a bit. In anexemplary embodiment of the invention, the correlation is applied onpairs of consecutive bits for double length input vectors (e.g., whichcontain enough chips to show two bits). Each correlation optionally hasan overlap of part of a bit or a whole bit.

This detection method may be used for regular detection rather than forMUD.

A further optional feature is a smart antenna controller, in which theantenna gain is directional. In the uplink channel, the limitation oncell capacity is mainly from inference between users. In the downlinkchannel, the limitation is caused by the limited transmission power. Inthe past it has apparently been proposed to use some type of smartantenna in environments where there are a relatively small number ofhighly interfering users. CDMA, however, is an environment where thereare a large number of low-inference users. In an exemplary embodiment ofthe invention, it is recognized that users with a low spreading factor(e.g., high data rate users) introduce a disproportionate amount ofinference. In an exemplary embodiment of the invention, low data rateusers are separated using MUD and high data rate users are separatedusing a smart antenna scheme. The number of actual antenna componentsused in an antenna can be derived from the required angular separation,for example. Different components may have different beam shapes.

In an exemplary embodiment of the invention, the process of applying asmart uplink antenna comprises receiving the signal from multipleantenna components and selectively applying a fractional gain to antennacomponents for a particular high inference source, while retaining again for the direction of the source, so that when the contributionsfrom the components are combined the signals for the low inference usersare not swamped out by the high inference user. Three different types ofselective gain may be applied, including: null steering, fixed beams(e.g., beam forming without zeros) and beam shaping (e.g., beam formingwith zeros, spatial filtering). In an exemplary embodiment of theinvention, null steering is used for the uplink. It is noted that theidentification of the interfering signals and their number is generallyknown to the base station.

FIG. 12 is a schematic data flow diagram 900 for a rake receiverimplementation that includes a smart antenna and MUD, in accordance withan exemplary embodiment of the invention. At 902 a filter bank forweights determination is loaded. The values are decided, for example,based on the known high inference users. These users may be known, forexample, by identifying them during connection process starts andsending the connection parameters to the searcher (e.g., in receiver600) for each connection prior to the start of the data transfer (e.g.,in the signaling phase of the connection/service setup). In an exemplaryembodiment of the invention, the weights are estimated based on pilotbits detection and are updated at a low rate. At 904 the weights for theantenna segments are calculated. At 906 a new matrix is loaded, forcorrelation purposes, so that at 908 a spatial search of the input fromthe antenna segments is performed (e.g., first correlation then peakdetection). At 910 a filter bank is loaded so that 912 performs afiltering function. The resulting signals can now be combined andprocessed using a MUD method, for example using elements 802-816, as inFIG. 11. Alternatively a regular rake receiver function is continued ofcorrelating the multiple paths and combining finger results, asdescribed above.

In a downlink antenna, the signals for the users are multiplied by aweighting matrix to determine the weights for each antenna segment, toprovide beam forming. The transmission power is multiplied by the numberof the antenna elements (e.g., each element has its own poweramplifier). In an exemplary embodiment of the invention, the location ofeach user is determined using well known direction finding methods.Optionally, these methods use module 102 to determine the direction formultiple users and/or paths in parallel, using module 102 for solvingmultiple simultaneous equations, for example. In an exemplary embodimentof the invention, the direction of a user is found by sending asecondary spreading factor to the user (different for each antenna lobe)and then, by identifying the spreading factor, determining which returnpath the user is listening on. The return direction is the direction ofthe uplink as a whole and can be used to better aim the antenna fordownlink (e.g., determining which beam direction the user responded to)and for uplink (e.g., determining which direction to listen to for thatuser).

An additional processing that may be performed by module 102 is thegeneration of multiple encoding and spread data, in parallel formultiple downlink channels.

It should be appreciated that using the methods described above, such asMUD and smart antenna, a cell can have a higher capacity. Alternativelyor additionally, the cell can have spare capacity to handle calls fromnearby cells. An additional potential advantage is better power controldue to more frequent monitoring or actual power received from usersand/or to be transmitted to users. Another potential advantage is highercapacity, due to more effective channel estimation.

In implementing a system using a VMM processor, the computations can bedistributed in various ways between the computational components (e.g.,VMM, VPU and DSP). While a variety of distribution methods arecontemplated, in an exemplary embodiment of the invention, processesthat can be redefined as VMM operations, are performed by the VMM. Ingeneral, as the VMM itself is massively parallel, it may be preferableto perform more operations, albeit in parallel. Any balance can beperformed by the DSP, for example, using methods well known in the art.It should that a typical UMTS implementation includes many componentsthat may not be applied using a VMM, for example, code generators andsymbol reprocessors.

It should be noted, in addition, that some of the methods describedherein may be applied also in non-VMM systems, for example, fractionalMUD estimation reduction.

Furthermore, the terms row and column were used herein in description ofspecific operations on the matrix. It is noted that the use of rows andcolumns may be interchanged by a simple change in the optics setupand/or the matrix arrangement.

The present invention has been described using non-limiting detaileddescriptions of embodiments thereof that are provided by way of exampleand are not intended to limit the scope of the invention. In particularsome of the exemplary numerical figures for example, sizes (e.g., ofmatrix, of input), accuracy and/or precision, are derived from numberscurrently associated with non-finalized standards and can change, forexample, if the standards change or depending on the implementation. Inaddition, the implementation may include various degrees of distributionof processing components. Further, even in a real-time system andespecially in a non-real-time system, various of the calculations (e.g.,calibration) may be performed on-line or off line. The electroniccircuits may be, for example, hardware, software and/or firmware. Itshould be understood that features and/or steps described with respectto one embodiment may be used with other embodiments and that not allembodiments of the invention have all of the features and/or steps shownin a particular figure or described with respect to one of theembodiments. Variations of embodiments described will occur to personsof the art.

It is noted that some of the above described embodiments may describethe best mode contemplated by the inventors and therefore includestructure, acts or details of structures and acts that may not beessential to the invention and which are described as examples.Structure and acts described herein are replaceable by equivalents whichperform the same function, even if the structure or acts are different,as known in the art. Therefore, the scope of the invention is limitedonly by the elements and limitations as used in the claims. When used inthe following claims, the terms “comprise”, “include”, “have” and theirconjugates mean “including but not limited to”.

1. An integrated VMM (vector-matrix multiplier) module, comprising: an electro-optical VMM component that multiplies an input vector by a matrix to produce an output vector; an electronic VPU (vector processing unit) that digitally processes digital data comprised in at least one of the input and output vectors; and a memory that stores at least one of matrix replacement values, at least one previous output vector and instructions for a component of the module.
 2. A module according to claim 1, comprising a DSP (digital signal processor) that processes at least one of the input vector, the output vector or the matrix.
 3. A module according to claim 2, wherein at least one of said DSP and VPU are programmed to calculate an update value for at least part of said matrix.
 4. A module according to claim 1, wherein said VMM component includes a local memory buffer for update values of said matrix.
 5. A module according to claim 1, comprising a register file adapted for exchanging information between said VMM and said VPU.
 6. A module according to claim 5, wherein said register file includes a register copy ability for transferring information between registers.
 7. A module according to claim 1, comprising a parameter extractor which extracts at least one parameter from at least one of said vectors.
 8. A module according to claim 7, wherein said parameter comprises an extreme value element.
 9. A module according to claim 1, wherein said VMM module comprises a pre-processor which preprocesses said input vector, to improve a quality of its processing by a matrix component of said VMM.
 10. A module according to claim 1, wherein said VMM module comprises a pre-processor which preprocesses said input vector, to correct for artifacts caused by processing by a matrix component of said VMM.
 11. A module according to claim 1, comprising a vector buffer for buffered input from an external circuit.
 12. A module according to claim 11, wherein said buffer receives 8 bit data in parallel.
 13. A module according to claim 1, comprising a controller which controls the operations of the VMM component and the electronic VPU.
 14. A module according to claim 13, wherein the VMM component and the electronic VPU have operations cycles and the controller controls the operations of the VMM component and the electronic VPU in each operation cycle based on commands it receives.
 15. A module according to claim 14, comprising controller memory which stores the commands of the controller.
 16. A module according to claim 14, wherein the commands received by the controller include a field for each of the VMM component and the electronic VPU.
 17. A module according to claim 14, wherein the commands received by the controller include a field for program flow commands.
 18. A module according to claim 14, wherein the commands indicate for the VPU one or more of a precision to be used by the VPU, a method of overflow treatment and a parameter of truncation.
 19. A module according to claim 14, wherein the commands are generated by compiling high language directives.
 20. A module according to claim 1, formed as part of at least one of the following: a cellular telephone signal processor; WBCDMA (Wide Band Code Division Multiplexing Access) processor; an automated face recognition system; an XDSL (Digital Subscriber Line) modem; an OFDM (Orthogonal Frequency Division Multiplexing) processor; a VDB Wireless Broadcast processor; a GSM cellular communication system; an EDGE (2.5 G) cellular communication system; a packet processor; a router; a switch; a compression protocol processor; a decompression protocol processor; a JPEG processor; an MPEG processor; an MP3 processor; a CELP/LPC voice processor; a spectrum analyzer; a machine vision processor; and a correlation engine.
 21. An integrated VMM (vector-matrix multiplier) module, comprising: a electro-optical VMM component that multiplies an input vector by a matrix to produce an output vector; and a controller, wherein said controller is operative to replace values in only a part of said matrix.
 22. A method of improving signal detection in an electro-optical VMM, comprising: receiving an input vector and a matrix to be processed by said VMM; and rearranging said input vector on an input of said VMM and said matrix in a matrix portion of said VMM, in a manner that improves signal detection.
 23. A method according to claim 22, wherein rearranging comprises spatially separating vector elements to reduce cross-talk.
 24. A method according to claim 22, wherein rearranging comprises duplicating at least some vector elements.
 25. A method according to claim 24, wherein rearranging comprises duplicating an entire vector.
 26. A method according to claim 24, wherein rearranging comprises rearranging said matrix.
 27. A method according to claim 24, wherein rearranging comprises rearranging said input vector so that at least some light sources of said VMM can be extinguished.
 28. A method of improving signal detection in an electro-optical VMM, comprising: receiving an input vector and a matrix to be processed by said VMM; and adapting values of at least one of said input vector on an input of said VMM and said matrix in a matrix portion of said VMM, in a manner that improves signal detection.
 29. A method according to claim 28, wherein adapting comprising negating values of at least some vector elements.
 30. A method according to claim 28, wherein adapting comprises shifting a baseline value to be non-zero, such that light sources of the VMM are not extinguished to achieve this base line value.
 31. A method according to claim 28, wherein adapting comprises amplifying or reducing input value to make use of an available dynamic range of said VMM.
 32. A method according to claim 28, wherein adapting comprises shifting an input value base line to make use of an available dynamic range of said VMM.
 33. A method according to claim 28, wherein adapting comprises applying a linearity correction.
 34. A method according to claim 28, wherein adapting comprises weighting vector elements with weights that correspond to a number of zero values in a corresponding matrix column.
 35. A method according to claim 28, wherein adapting comprises weighting at least one of vector elements and matrix elements with weights that correspond to an average of values in corresponding matrix columns.
 36. A method of improving signal detection in an electro-optical VMM, comprising: receiving an input vector and a matrix to be processed by said VMM; processing said vector by said VMM to produce an output vector; and adapting values of said output vector, by applying a history correction which corrects for residual affects of a previous computation performed by said VMM.
 37. A method according to claim 36, wherein said adapting comprises applying a temperature correction.
 38. A method according to claim 36, wherein said adapting comprises applying a correction for an adaptation made to said input vector. 